GCS to S3 transfer - improve speed
- We perform a weekly transfer from GCS to S3 using gsutil command below. 5,000 compressed objects, ~82 MB each - combined size of ~380 GB. It exports data to be used by Redshift, if that's of any relevance
- Same kind of transfer from an on-prem Hadoop cluster to S3 took under 1 hour. Now with gsutil, it takes 4-5 hours.
- I'm aware that, under the hood, gsutil downloads the files from GCS and then uploads them to S3 which adds some overhead. So, hoping for faster speeds, I've tried executing gsutil on Compute Engine in the geographical location of S3 and GCS buckets but it was equally slow
I've played with parallel_process_count and parallel_thread_count parameters but it made no difference
gsutil -m rsync -r -n GCS_DIR S3_DIR
My questions are:
- Is there anything else I can do to speed it up?
- What combinations of parallel_process_count and parallel_thread_count would you try?
- Is there any way to find out which stage creates the bottleneck (if any)? I.e. is it upload or download stage?
Looking at logs, does below mean that bandwidth is at 0% for a period of time?
Copying gcs://**s3.000000004972.gz
[Content-Type=application/octet-stream]...
[4.8k/5.0k files][367.4 GiB/381.6 GiB] 96% Done 0.0 B/s
Thanks in advance :)
amazon-web-services amazon-s3 google-cloud-platform google-cloud-storage
add a comment |
- We perform a weekly transfer from GCS to S3 using gsutil command below. 5,000 compressed objects, ~82 MB each - combined size of ~380 GB. It exports data to be used by Redshift, if that's of any relevance
- Same kind of transfer from an on-prem Hadoop cluster to S3 took under 1 hour. Now with gsutil, it takes 4-5 hours.
- I'm aware that, under the hood, gsutil downloads the files from GCS and then uploads them to S3 which adds some overhead. So, hoping for faster speeds, I've tried executing gsutil on Compute Engine in the geographical location of S3 and GCS buckets but it was equally slow
I've played with parallel_process_count and parallel_thread_count parameters but it made no difference
gsutil -m rsync -r -n GCS_DIR S3_DIR
My questions are:
- Is there anything else I can do to speed it up?
- What combinations of parallel_process_count and parallel_thread_count would you try?
- Is there any way to find out which stage creates the bottleneck (if any)? I.e. is it upload or download stage?
Looking at logs, does below mean that bandwidth is at 0% for a period of time?
Copying gcs://**s3.000000004972.gz
[Content-Type=application/octet-stream]...
[4.8k/5.0k files][367.4 GiB/381.6 GiB] 96% Done 0.0 B/s
Thanks in advance :)
amazon-web-services amazon-s3 google-cloud-platform google-cloud-storage
Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.
– Samuel N
Nov 27 '18 at 3:46
Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.
– cherry9090
Nov 27 '18 at 8:29
add a comment |
- We perform a weekly transfer from GCS to S3 using gsutil command below. 5,000 compressed objects, ~82 MB each - combined size of ~380 GB. It exports data to be used by Redshift, if that's of any relevance
- Same kind of transfer from an on-prem Hadoop cluster to S3 took under 1 hour. Now with gsutil, it takes 4-5 hours.
- I'm aware that, under the hood, gsutil downloads the files from GCS and then uploads them to S3 which adds some overhead. So, hoping for faster speeds, I've tried executing gsutil on Compute Engine in the geographical location of S3 and GCS buckets but it was equally slow
I've played with parallel_process_count and parallel_thread_count parameters but it made no difference
gsutil -m rsync -r -n GCS_DIR S3_DIR
My questions are:
- Is there anything else I can do to speed it up?
- What combinations of parallel_process_count and parallel_thread_count would you try?
- Is there any way to find out which stage creates the bottleneck (if any)? I.e. is it upload or download stage?
Looking at logs, does below mean that bandwidth is at 0% for a period of time?
Copying gcs://**s3.000000004972.gz
[Content-Type=application/octet-stream]...
[4.8k/5.0k files][367.4 GiB/381.6 GiB] 96% Done 0.0 B/s
Thanks in advance :)
amazon-web-services amazon-s3 google-cloud-platform google-cloud-storage
- We perform a weekly transfer from GCS to S3 using gsutil command below. 5,000 compressed objects, ~82 MB each - combined size of ~380 GB. It exports data to be used by Redshift, if that's of any relevance
- Same kind of transfer from an on-prem Hadoop cluster to S3 took under 1 hour. Now with gsutil, it takes 4-5 hours.
- I'm aware that, under the hood, gsutil downloads the files from GCS and then uploads them to S3 which adds some overhead. So, hoping for faster speeds, I've tried executing gsutil on Compute Engine in the geographical location of S3 and GCS buckets but it was equally slow
I've played with parallel_process_count and parallel_thread_count parameters but it made no difference
gsutil -m rsync -r -n GCS_DIR S3_DIR
My questions are:
- Is there anything else I can do to speed it up?
- What combinations of parallel_process_count and parallel_thread_count would you try?
- Is there any way to find out which stage creates the bottleneck (if any)? I.e. is it upload or download stage?
Looking at logs, does below mean that bandwidth is at 0% for a period of time?
Copying gcs://**s3.000000004972.gz
[Content-Type=application/octet-stream]...
[4.8k/5.0k files][367.4 GiB/381.6 GiB] 96% Done 0.0 B/s
Thanks in advance :)
amazon-web-services amazon-s3 google-cloud-platform google-cloud-storage
amazon-web-services amazon-s3 google-cloud-platform google-cloud-storage
edited Nov 26 '18 at 14:16
Mikhail Berlyant
62.6k43774
62.6k43774
asked Nov 26 '18 at 10:58
cherry9090cherry9090
62
62
Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.
– Samuel N
Nov 27 '18 at 3:46
Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.
– cherry9090
Nov 27 '18 at 8:29
add a comment |
Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.
– Samuel N
Nov 27 '18 at 3:46
Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.
– cherry9090
Nov 27 '18 at 8:29
Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.
– Samuel N
Nov 27 '18 at 3:46
Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.
– Samuel N
Nov 27 '18 at 3:46
Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.
– cherry9090
Nov 27 '18 at 8:29
Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.
– cherry9090
Nov 27 '18 at 8:29
add a comment |
1 Answer
1
active
oldest
votes
The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.
You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.
The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.
As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.
add a comment |
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53479656%2fgcs-to-s3-transfer-improve-speed%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.
You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.
The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.
As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.
add a comment |
The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.
You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.
The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.
As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.
add a comment |
The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.
You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.
The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.
As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.
The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.
You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.
The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.
As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.
answered Nov 28 '18 at 9:28
Christopher PChristopher P
59827
59827
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53479656%2fgcs-to-s3-transfer-improve-speed%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.
– Samuel N
Nov 27 '18 at 3:46
Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.
– cherry9090
Nov 27 '18 at 8:29