GCS to S3 transfer - improve speed












1
















  • We perform a weekly transfer from GCS to S3 using gsutil command below. 5,000 compressed objects, ~82 MB each - combined size of ~380 GB. It exports data to be used by Redshift, if that's of any relevance

  • Same kind of transfer from an on-prem Hadoop cluster to S3 took under 1 hour. Now with gsutil, it takes 4-5 hours.

  • I'm aware that, under the hood, gsutil downloads the files from GCS and then uploads them to S3 which adds some overhead. So, hoping for faster speeds, I've tried executing gsutil on Compute Engine in the geographical location of S3 and GCS buckets but it was equally slow


  • I've played with parallel_process_count and parallel_thread_count parameters but it made no difference



    gsutil -m rsync -r -n GCS_DIR S3_DIR



My questions are:




  • Is there anything else I can do to speed it up?

  • What combinations of parallel_process_count and parallel_thread_count would you try?

  • Is there any way to find out which stage creates the bottleneck (if any)? I.e. is it upload or download stage?


  • Looking at logs, does below mean that bandwidth is at 0% for a period of time?



    Copying gcs://**s3.000000004972.gz 
    [Content-Type=application/octet-stream]...
    [4.8k/5.0k files][367.4 GiB/381.6 GiB] 96% Done 0.0 B/s



Thanks in advance :)










share|improve this question

























  • Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.

    – Samuel N
    Nov 27 '18 at 3:46











  • Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.

    – cherry9090
    Nov 27 '18 at 8:29
















1
















  • We perform a weekly transfer from GCS to S3 using gsutil command below. 5,000 compressed objects, ~82 MB each - combined size of ~380 GB. It exports data to be used by Redshift, if that's of any relevance

  • Same kind of transfer from an on-prem Hadoop cluster to S3 took under 1 hour. Now with gsutil, it takes 4-5 hours.

  • I'm aware that, under the hood, gsutil downloads the files from GCS and then uploads them to S3 which adds some overhead. So, hoping for faster speeds, I've tried executing gsutil on Compute Engine in the geographical location of S3 and GCS buckets but it was equally slow


  • I've played with parallel_process_count and parallel_thread_count parameters but it made no difference



    gsutil -m rsync -r -n GCS_DIR S3_DIR



My questions are:




  • Is there anything else I can do to speed it up?

  • What combinations of parallel_process_count and parallel_thread_count would you try?

  • Is there any way to find out which stage creates the bottleneck (if any)? I.e. is it upload or download stage?


  • Looking at logs, does below mean that bandwidth is at 0% for a period of time?



    Copying gcs://**s3.000000004972.gz 
    [Content-Type=application/octet-stream]...
    [4.8k/5.0k files][367.4 GiB/381.6 GiB] 96% Done 0.0 B/s



Thanks in advance :)










share|improve this question

























  • Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.

    – Samuel N
    Nov 27 '18 at 3:46











  • Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.

    – cherry9090
    Nov 27 '18 at 8:29














1












1








1









  • We perform a weekly transfer from GCS to S3 using gsutil command below. 5,000 compressed objects, ~82 MB each - combined size of ~380 GB. It exports data to be used by Redshift, if that's of any relevance

  • Same kind of transfer from an on-prem Hadoop cluster to S3 took under 1 hour. Now with gsutil, it takes 4-5 hours.

  • I'm aware that, under the hood, gsutil downloads the files from GCS and then uploads them to S3 which adds some overhead. So, hoping for faster speeds, I've tried executing gsutil on Compute Engine in the geographical location of S3 and GCS buckets but it was equally slow


  • I've played with parallel_process_count and parallel_thread_count parameters but it made no difference



    gsutil -m rsync -r -n GCS_DIR S3_DIR



My questions are:




  • Is there anything else I can do to speed it up?

  • What combinations of parallel_process_count and parallel_thread_count would you try?

  • Is there any way to find out which stage creates the bottleneck (if any)? I.e. is it upload or download stage?


  • Looking at logs, does below mean that bandwidth is at 0% for a period of time?



    Copying gcs://**s3.000000004972.gz 
    [Content-Type=application/octet-stream]...
    [4.8k/5.0k files][367.4 GiB/381.6 GiB] 96% Done 0.0 B/s



Thanks in advance :)










share|improve this question

















  • We perform a weekly transfer from GCS to S3 using gsutil command below. 5,000 compressed objects, ~82 MB each - combined size of ~380 GB. It exports data to be used by Redshift, if that's of any relevance

  • Same kind of transfer from an on-prem Hadoop cluster to S3 took under 1 hour. Now with gsutil, it takes 4-5 hours.

  • I'm aware that, under the hood, gsutil downloads the files from GCS and then uploads them to S3 which adds some overhead. So, hoping for faster speeds, I've tried executing gsutil on Compute Engine in the geographical location of S3 and GCS buckets but it was equally slow


  • I've played with parallel_process_count and parallel_thread_count parameters but it made no difference



    gsutil -m rsync -r -n GCS_DIR S3_DIR



My questions are:




  • Is there anything else I can do to speed it up?

  • What combinations of parallel_process_count and parallel_thread_count would you try?

  • Is there any way to find out which stage creates the bottleneck (if any)? I.e. is it upload or download stage?


  • Looking at logs, does below mean that bandwidth is at 0% for a period of time?



    Copying gcs://**s3.000000004972.gz 
    [Content-Type=application/octet-stream]...
    [4.8k/5.0k files][367.4 GiB/381.6 GiB] 96% Done 0.0 B/s



Thanks in advance :)







amazon-web-services amazon-s3 google-cloud-platform google-cloud-storage






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 26 '18 at 14:16









Mikhail Berlyant

62.6k43774




62.6k43774










asked Nov 26 '18 at 10:58









cherry9090cherry9090

62




62













  • Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.

    – Samuel N
    Nov 27 '18 at 3:46











  • Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.

    – cherry9090
    Nov 27 '18 at 8:29



















  • Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.

    – Samuel N
    Nov 27 '18 at 3:46











  • Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.

    – cherry9090
    Nov 27 '18 at 8:29

















Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.

– Samuel N
Nov 27 '18 at 3:46





Why don't you use Storage Transfer Service to transfer objects from GCS to S3? This is the preferred method for your use case.

– Samuel N
Nov 27 '18 at 3:46













Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.

– cherry9090
Nov 27 '18 at 8:29





Storage Transfer Service can only be used to load from S3. It doesn't let you export data to S3.

– cherry9090
Nov 27 '18 at 8:29












1 Answer
1






active

oldest

votes


















0














The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.



You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.



The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.



As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.






share|improve this answer
























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53479656%2fgcs-to-s3-transfer-improve-speed%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.



    You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.



    The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.



    As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.






    share|improve this answer




























      0














      The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.



      You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.



      The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.



      As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.






      share|improve this answer


























        0












        0








        0







        The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.



        You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.



        The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.



        As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.






        share|improve this answer













        The optimal values for parallel_process_count and parallel_thread_count depend on network speed, number of CPUs and available memory - it's recommended that you experiment a bit to find the optimal values.



        You might try using perfdiag to get more information about the bucket on Google Cloud's side - it's a command that runs a suite of diagnostic tests for a given bucket.



        The output you've shared indicates that no upload is happening for some period of time, perhaps due to the way gsutil chunks the uploads.



        As a final recommendation for speeding up your transfers to Amazon, you might try using Apache Beam / Dataflow.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 28 '18 at 9:28









        Christopher PChristopher P

        59827




        59827
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53479656%2fgcs-to-s3-transfer-improve-speed%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Wiesbaden

            Marschland

            Dieringhausen