PBS - Job exceeded reserved number of CPUs











up vote
0
down vote

favorite












I'm training deep learning network called darknet/YOLO on remote server with NVIDIA graphic cards.



I'm submitting job which train neural network on GPU, which works well.



Problem is in huge amount of consumed processors. I usually submit a job with 8 processors, like this



qsub -q gpu select=1:ncpus=8:ngpus=1:mem=15gb:gpu_cap=cuda61


but it's always killed because of exceeded number of processors. Even if i use 20 processors its exceeded.



I don't know why job consumes so many processors on the server, even tho i may run the job on my notebook with Intel i5 processor, but its too slow.



What I've tried:



1) Set cgroups=cpuacct which forces the job to NOT to use more processors then assigned, but it didn't work at all. Seem's like restriction works just in case server dont have resources for others. In the case there are free processors, the restriction doesnt work (https://drill.apache.org/docs/configuring-cgroups-to-control-cpu-usage/#cpu-limits)



2) Set place=excelhost which does not kill the job in case it exceed assigned resources. On the other side, it takes like 7 days to even start the job with this flag and I have to train network every day.



Question:



I don't need these processors and i don't understand why the job uses so many of them. How may i force the job to NOT exceed the given number of processors ? Or some other idea how could i solve this kind of problem ?










share|improve this question




























    up vote
    0
    down vote

    favorite












    I'm training deep learning network called darknet/YOLO on remote server with NVIDIA graphic cards.



    I'm submitting job which train neural network on GPU, which works well.



    Problem is in huge amount of consumed processors. I usually submit a job with 8 processors, like this



    qsub -q gpu select=1:ncpus=8:ngpus=1:mem=15gb:gpu_cap=cuda61


    but it's always killed because of exceeded number of processors. Even if i use 20 processors its exceeded.



    I don't know why job consumes so many processors on the server, even tho i may run the job on my notebook with Intel i5 processor, but its too slow.



    What I've tried:



    1) Set cgroups=cpuacct which forces the job to NOT to use more processors then assigned, but it didn't work at all. Seem's like restriction works just in case server dont have resources for others. In the case there are free processors, the restriction doesnt work (https://drill.apache.org/docs/configuring-cgroups-to-control-cpu-usage/#cpu-limits)



    2) Set place=excelhost which does not kill the job in case it exceed assigned resources. On the other side, it takes like 7 days to even start the job with this flag and I have to train network every day.



    Question:



    I don't need these processors and i don't understand why the job uses so many of them. How may i force the job to NOT exceed the given number of processors ? Or some other idea how could i solve this kind of problem ?










    share|improve this question


























      up vote
      0
      down vote

      favorite









      up vote
      0
      down vote

      favorite











      I'm training deep learning network called darknet/YOLO on remote server with NVIDIA graphic cards.



      I'm submitting job which train neural network on GPU, which works well.



      Problem is in huge amount of consumed processors. I usually submit a job with 8 processors, like this



      qsub -q gpu select=1:ncpus=8:ngpus=1:mem=15gb:gpu_cap=cuda61


      but it's always killed because of exceeded number of processors. Even if i use 20 processors its exceeded.



      I don't know why job consumes so many processors on the server, even tho i may run the job on my notebook with Intel i5 processor, but its too slow.



      What I've tried:



      1) Set cgroups=cpuacct which forces the job to NOT to use more processors then assigned, but it didn't work at all. Seem's like restriction works just in case server dont have resources for others. In the case there are free processors, the restriction doesnt work (https://drill.apache.org/docs/configuring-cgroups-to-control-cpu-usage/#cpu-limits)



      2) Set place=excelhost which does not kill the job in case it exceed assigned resources. On the other side, it takes like 7 days to even start the job with this flag and I have to train network every day.



      Question:



      I don't need these processors and i don't understand why the job uses so many of them. How may i force the job to NOT exceed the given number of processors ? Or some other idea how could i solve this kind of problem ?










      share|improve this question















      I'm training deep learning network called darknet/YOLO on remote server with NVIDIA graphic cards.



      I'm submitting job which train neural network on GPU, which works well.



      Problem is in huge amount of consumed processors. I usually submit a job with 8 processors, like this



      qsub -q gpu select=1:ncpus=8:ngpus=1:mem=15gb:gpu_cap=cuda61


      but it's always killed because of exceeded number of processors. Even if i use 20 processors its exceeded.



      I don't know why job consumes so many processors on the server, even tho i may run the job on my notebook with Intel i5 processor, but its too slow.



      What I've tried:



      1) Set cgroups=cpuacct which forces the job to NOT to use more processors then assigned, but it didn't work at all. Seem's like restriction works just in case server dont have resources for others. In the case there are free processors, the restriction doesnt work (https://drill.apache.org/docs/configuring-cgroups-to-control-cpu-usage/#cpu-limits)



      2) Set place=excelhost which does not kill the job in case it exceed assigned resources. On the other side, it takes like 7 days to even start the job with this flag and I have to train network every day.



      Question:



      I don't need these processors and i don't understand why the job uses so many of them. How may i force the job to NOT exceed the given number of processors ? Or some other idea how could i solve this kind of problem ?







      linux pbs yolo darknet






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 20 at 6:33

























      asked Nov 19 at 15:02









      Filip Kočica

      5,5852732




      5,5852732
























          1 Answer
          1






          active

          oldest

          votes

















          up vote
          0
          down vote













          It is more likely that it is a mismatch between admin set restrictions for that queue and your request. So ping your admin and get the details of the queues. (e.g queue1 ppm, gpu's)






          share|improve this answer





















            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














             

            draft saved


            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53377358%2fpbs-job-exceeded-reserved-number-of-cpus%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            1 Answer
            1






            active

            oldest

            votes








            1 Answer
            1






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes








            up vote
            0
            down vote













            It is more likely that it is a mismatch between admin set restrictions for that queue and your request. So ping your admin and get the details of the queues. (e.g queue1 ppm, gpu's)






            share|improve this answer

























              up vote
              0
              down vote













              It is more likely that it is a mismatch between admin set restrictions for that queue and your request. So ping your admin and get the details of the queues. (e.g queue1 ppm, gpu's)






              share|improve this answer























                up vote
                0
                down vote










                up vote
                0
                down vote









                It is more likely that it is a mismatch between admin set restrictions for that queue and your request. So ping your admin and get the details of the queues. (e.g queue1 ppm, gpu's)






                share|improve this answer












                It is more likely that it is a mismatch between admin set restrictions for that queue and your request. So ping your admin and get the details of the queues. (e.g queue1 ppm, gpu's)







                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 19 at 23:46









                nav

                675612




                675612






























                     

                    draft saved


                    draft discarded



















































                     


                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53377358%2fpbs-job-exceeded-reserved-number-of-cpus%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Wiesbaden

                    Marschland

                    Dieringhausen