Why spark spark code taking 2hrs to load 2 million records into cassandra cluster ? how to optimize its...
up vote
-1
down vote
favorite
I am trying to push my sprak processed data to 3 node cluster of C*.
I am pushing 200 million records to cassandra it is taking 2 hours....
Below it the my spark cluster configuration
- Nodes : 12
- vCores Total : 112
- Total memory : 1.5 TB.
Below are my spark-sumbit parameters:

I have made the spark data frame partitions 10 as below
val df = df_raw.repartition(numOfPartitions) // numOfPartitions = 10
But still my application is very slow.
Can you please help me what am I doing wrong here ?
apache-spark cassandra apache-spark-sql datastax
|
show 8 more comments
up vote
-1
down vote
favorite
I am trying to push my sprak processed data to 3 node cluster of C*.
I am pushing 200 million records to cassandra it is taking 2 hours....
Below it the my spark cluster configuration
- Nodes : 12
- vCores Total : 112
- Total memory : 1.5 TB.
Below are my spark-sumbit parameters:

I have made the spark data frame partitions 10 as below
val df = df_raw.repartition(numOfPartitions) // numOfPartitions = 10
But still my application is very slow.
Can you please help me what am I doing wrong here ?
apache-spark cassandra apache-spark-sql datastax
what's the average size of a record? Have you tried increasing the output.throughput_mb_per_sec a little to see whether you're maxing that out?
– Justin Cameron
Nov 20 at 2:53
@Ramesh Maharjan sir any help on this ?
– user3252097
Nov 20 at 6:48
@Justin Cameron the each row of around 850 bytes
– user3252097
Nov 20 at 7:21
@rameshMaha i am already using the same ...df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show ...but no respite
– user3252097
Nov 20 at 7:23
Yeah, looks like you're probably being constrained by the 5 mb/sec output throttle that you have configured. (5*1024*1024)/850 = ~6000 records/second (uncompressed). 200 million in 2 hours = ~28000 records/second (compressed). Try increasingspark.casssandra.output.throughput_mb_per_secand see if it improves.
– Justin Cameron
Nov 20 at 22:36
|
show 8 more comments
up vote
-1
down vote
favorite
up vote
-1
down vote
favorite
I am trying to push my sprak processed data to 3 node cluster of C*.
I am pushing 200 million records to cassandra it is taking 2 hours....
Below it the my spark cluster configuration
- Nodes : 12
- vCores Total : 112
- Total memory : 1.5 TB.
Below are my spark-sumbit parameters:

I have made the spark data frame partitions 10 as below
val df = df_raw.repartition(numOfPartitions) // numOfPartitions = 10
But still my application is very slow.
Can you please help me what am I doing wrong here ?
apache-spark cassandra apache-spark-sql datastax
I am trying to push my sprak processed data to 3 node cluster of C*.
I am pushing 200 million records to cassandra it is taking 2 hours....
Below it the my spark cluster configuration
- Nodes : 12
- vCores Total : 112
- Total memory : 1.5 TB.
Below are my spark-sumbit parameters:

I have made the spark data frame partitions 10 as below
val df = df_raw.repartition(numOfPartitions) // numOfPartitions = 10
But still my application is very slow.
Can you please help me what am I doing wrong here ?
apache-spark cassandra apache-spark-sql datastax
apache-spark cassandra apache-spark-sql datastax
edited Nov 19 at 17:22
user6910411
32k86692
32k86692
asked Nov 19 at 17:02
user3252097
1561215
1561215
what's the average size of a record? Have you tried increasing the output.throughput_mb_per_sec a little to see whether you're maxing that out?
– Justin Cameron
Nov 20 at 2:53
@Ramesh Maharjan sir any help on this ?
– user3252097
Nov 20 at 6:48
@Justin Cameron the each row of around 850 bytes
– user3252097
Nov 20 at 7:21
@rameshMaha i am already using the same ...df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show ...but no respite
– user3252097
Nov 20 at 7:23
Yeah, looks like you're probably being constrained by the 5 mb/sec output throttle that you have configured. (5*1024*1024)/850 = ~6000 records/second (uncompressed). 200 million in 2 hours = ~28000 records/second (compressed). Try increasingspark.casssandra.output.throughput_mb_per_secand see if it improves.
– Justin Cameron
Nov 20 at 22:36
|
show 8 more comments
what's the average size of a record? Have you tried increasing the output.throughput_mb_per_sec a little to see whether you're maxing that out?
– Justin Cameron
Nov 20 at 2:53
@Ramesh Maharjan sir any help on this ?
– user3252097
Nov 20 at 6:48
@Justin Cameron the each row of around 850 bytes
– user3252097
Nov 20 at 7:21
@rameshMaha i am already using the same ...df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show ...but no respite
– user3252097
Nov 20 at 7:23
Yeah, looks like you're probably being constrained by the 5 mb/sec output throttle that you have configured. (5*1024*1024)/850 = ~6000 records/second (uncompressed). 200 million in 2 hours = ~28000 records/second (compressed). Try increasingspark.casssandra.output.throughput_mb_per_secand see if it improves.
– Justin Cameron
Nov 20 at 22:36
what's the average size of a record? Have you tried increasing the output.throughput_mb_per_sec a little to see whether you're maxing that out?
– Justin Cameron
Nov 20 at 2:53
what's the average size of a record? Have you tried increasing the output.throughput_mb_per_sec a little to see whether you're maxing that out?
– Justin Cameron
Nov 20 at 2:53
@Ramesh Maharjan sir any help on this ?
– user3252097
Nov 20 at 6:48
@Ramesh Maharjan sir any help on this ?
– user3252097
Nov 20 at 6:48
@Justin Cameron the each row of around 850 bytes
– user3252097
Nov 20 at 7:21
@Justin Cameron the each row of around 850 bytes
– user3252097
Nov 20 at 7:21
@rameshMaha i am already using the same ...df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show ...but no respite
– user3252097
Nov 20 at 7:23
@rameshMaha i am already using the same ...df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show ...but no respite
– user3252097
Nov 20 at 7:23
Yeah, looks like you're probably being constrained by the 5 mb/sec output throttle that you have configured. (5*1024*1024)/850 = ~6000 records/second (uncompressed). 200 million in 2 hours = ~28000 records/second (compressed). Try increasing
spark.casssandra.output.throughput_mb_per_sec and see if it improves.– Justin Cameron
Nov 20 at 22:36
Yeah, looks like you're probably being constrained by the 5 mb/sec output throttle that you have configured. (5*1024*1024)/850 = ~6000 records/second (uncompressed). 200 million in 2 hours = ~28000 records/second (compressed). Try increasing
spark.casssandra.output.throughput_mb_per_sec and see if it improves.– Justin Cameron
Nov 20 at 22:36
|
show 8 more comments
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
active
oldest
votes
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53379459%2fwhy-spark-spark-code-taking-2hrs-to-load-2-million-records-into-cassandra-cluste%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
what's the average size of a record? Have you tried increasing the output.throughput_mb_per_sec a little to see whether you're maxing that out?
– Justin Cameron
Nov 20 at 2:53
@Ramesh Maharjan sir any help on this ?
– user3252097
Nov 20 at 6:48
@Justin Cameron the each row of around 850 bytes
– user3252097
Nov 20 at 7:21
@rameshMaha i am already using the same ...df.withColumn("id", when($"id".isNull, 0).otherwise(1)).show ...but no respite
– user3252097
Nov 20 at 7:23
Yeah, looks like you're probably being constrained by the 5 mb/sec output throttle that you have configured. (5*1024*1024)/850 = ~6000 records/second (uncompressed). 200 million in 2 hours = ~28000 records/second (compressed). Try increasing
spark.casssandra.output.throughput_mb_per_secand see if it improves.– Justin Cameron
Nov 20 at 22:36