PBS - Job exceeded reserved number of CPUs
up vote
0
down vote
favorite
I'm training deep learning network called darknet/YOLO on remote server with NVIDIA graphic cards.
I'm submitting job which train neural network on GPU, which works well.
Problem is in huge amount of consumed processors. I usually submit a job with 8 processors, like this
qsub -q gpu select=1:ncpus=8:ngpus=1:mem=15gb:gpu_cap=cuda61
but it's always killed because of exceeded number of processors. Even if i use 20 processors its exceeded.
I don't know why job consumes so many processors on the server, even tho i may run the job on my notebook with Intel i5 processor, but its too slow.
What I've tried:
1) Set cgroups=cpuacct
which forces the job to NOT to use more processors then assigned, but it didn't work at all. Seem's like restriction works just in case server dont have resources for others. In the case there are free processors, the restriction doesnt work (https://drill.apache.org/docs/configuring-cgroups-to-control-cpu-usage/#cpu-limits)
2) Set place=excelhost
which does not kill the job in case it exceed assigned resources. On the other side, it takes like 7 days to even start the job with this flag and I have to train network every day.
Question:
I don't need these processors and i don't understand why the job uses so many of them. How may i force the job to NOT exceed the given number of processors ? Or some other idea how could i solve this kind of problem ?
linux pbs yolo darknet
add a comment |
up vote
0
down vote
favorite
I'm training deep learning network called darknet/YOLO on remote server with NVIDIA graphic cards.
I'm submitting job which train neural network on GPU, which works well.
Problem is in huge amount of consumed processors. I usually submit a job with 8 processors, like this
qsub -q gpu select=1:ncpus=8:ngpus=1:mem=15gb:gpu_cap=cuda61
but it's always killed because of exceeded number of processors. Even if i use 20 processors its exceeded.
I don't know why job consumes so many processors on the server, even tho i may run the job on my notebook with Intel i5 processor, but its too slow.
What I've tried:
1) Set cgroups=cpuacct
which forces the job to NOT to use more processors then assigned, but it didn't work at all. Seem's like restriction works just in case server dont have resources for others. In the case there are free processors, the restriction doesnt work (https://drill.apache.org/docs/configuring-cgroups-to-control-cpu-usage/#cpu-limits)
2) Set place=excelhost
which does not kill the job in case it exceed assigned resources. On the other side, it takes like 7 days to even start the job with this flag and I have to train network every day.
Question:
I don't need these processors and i don't understand why the job uses so many of them. How may i force the job to NOT exceed the given number of processors ? Or some other idea how could i solve this kind of problem ?
linux pbs yolo darknet
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm training deep learning network called darknet/YOLO on remote server with NVIDIA graphic cards.
I'm submitting job which train neural network on GPU, which works well.
Problem is in huge amount of consumed processors. I usually submit a job with 8 processors, like this
qsub -q gpu select=1:ncpus=8:ngpus=1:mem=15gb:gpu_cap=cuda61
but it's always killed because of exceeded number of processors. Even if i use 20 processors its exceeded.
I don't know why job consumes so many processors on the server, even tho i may run the job on my notebook with Intel i5 processor, but its too slow.
What I've tried:
1) Set cgroups=cpuacct
which forces the job to NOT to use more processors then assigned, but it didn't work at all. Seem's like restriction works just in case server dont have resources for others. In the case there are free processors, the restriction doesnt work (https://drill.apache.org/docs/configuring-cgroups-to-control-cpu-usage/#cpu-limits)
2) Set place=excelhost
which does not kill the job in case it exceed assigned resources. On the other side, it takes like 7 days to even start the job with this flag and I have to train network every day.
Question:
I don't need these processors and i don't understand why the job uses so many of them. How may i force the job to NOT exceed the given number of processors ? Or some other idea how could i solve this kind of problem ?
linux pbs yolo darknet
I'm training deep learning network called darknet/YOLO on remote server with NVIDIA graphic cards.
I'm submitting job which train neural network on GPU, which works well.
Problem is in huge amount of consumed processors. I usually submit a job with 8 processors, like this
qsub -q gpu select=1:ncpus=8:ngpus=1:mem=15gb:gpu_cap=cuda61
but it's always killed because of exceeded number of processors. Even if i use 20 processors its exceeded.
I don't know why job consumes so many processors on the server, even tho i may run the job on my notebook with Intel i5 processor, but its too slow.
What I've tried:
1) Set cgroups=cpuacct
which forces the job to NOT to use more processors then assigned, but it didn't work at all. Seem's like restriction works just in case server dont have resources for others. In the case there are free processors, the restriction doesnt work (https://drill.apache.org/docs/configuring-cgroups-to-control-cpu-usage/#cpu-limits)
2) Set place=excelhost
which does not kill the job in case it exceed assigned resources. On the other side, it takes like 7 days to even start the job with this flag and I have to train network every day.
Question:
I don't need these processors and i don't understand why the job uses so many of them. How may i force the job to NOT exceed the given number of processors ? Or some other idea how could i solve this kind of problem ?
linux pbs yolo darknet
linux pbs yolo darknet
edited Nov 20 at 6:33
asked Nov 19 at 15:02
Filip Kočica
5,5852732
5,5852732
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
up vote
0
down vote
It is more likely that it is a mismatch between admin set restrictions for that queue and your request. So ping your admin and get the details of the queues. (e.g queue1 ppm, gpu's)
add a comment |
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
0
down vote
It is more likely that it is a mismatch between admin set restrictions for that queue and your request. So ping your admin and get the details of the queues. (e.g queue1 ppm, gpu's)
add a comment |
up vote
0
down vote
It is more likely that it is a mismatch between admin set restrictions for that queue and your request. So ping your admin and get the details of the queues. (e.g queue1 ppm, gpu's)
add a comment |
up vote
0
down vote
up vote
0
down vote
It is more likely that it is a mismatch between admin set restrictions for that queue and your request. So ping your admin and get the details of the queues. (e.g queue1 ppm, gpu's)
It is more likely that it is a mismatch between admin set restrictions for that queue and your request. So ping your admin and get the details of the queues. (e.g queue1 ppm, gpu's)
answered Nov 19 at 23:46
nav
675612
675612
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53377358%2fpbs-job-exceeded-reserved-number-of-cpus%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown