Setting up a multi-user job scheduler for data science / ML tasks
Background
Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.
Since ML is growing here. I am looking for a better way to make use of our infrastucture.
Requierments
- Multi-user. PhDs and students should be able to run their tasks.
- Job queue or scheduling (preferably something like time-sliced scheduling)
- Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.
- Easy / Remote job submission: Maybe a webpage?
What I tried so far
I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.
First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.
SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.
In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.
My question
Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.
If you need more information, let me know.
Thanks
Tim!
kubernetes gpu cluster-computing slurm docker-datacenter
add a comment |
Background
Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.
Since ML is growing here. I am looking for a better way to make use of our infrastucture.
Requierments
- Multi-user. PhDs and students should be able to run their tasks.
- Job queue or scheduling (preferably something like time-sliced scheduling)
- Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.
- Easy / Remote job submission: Maybe a webpage?
What I tried so far
I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.
First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.
SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.
In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.
My question
Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.
If you need more information, let me know.
Thanks
Tim!
kubernetes gpu cluster-computing slurm docker-datacenter
Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?
– Dmitri Chubarov
Nov 28 '18 at 15:44
Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?
– Tim J.
Dec 3 '18 at 13:46
add a comment |
Background
Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.
Since ML is growing here. I am looking for a better way to make use of our infrastucture.
Requierments
- Multi-user. PhDs and students should be able to run their tasks.
- Job queue or scheduling (preferably something like time-sliced scheduling)
- Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.
- Easy / Remote job submission: Maybe a webpage?
What I tried so far
I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.
First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.
SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.
In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.
My question
Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.
If you need more information, let me know.
Thanks
Tim!
kubernetes gpu cluster-computing slurm docker-datacenter
Background
Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.
Since ML is growing here. I am looking for a better way to make use of our infrastucture.
Requierments
- Multi-user. PhDs and students should be able to run their tasks.
- Job queue or scheduling (preferably something like time-sliced scheduling)
- Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.
- Easy / Remote job submission: Maybe a webpage?
What I tried so far
I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.
First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.
SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.
In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.
My question
Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.
If you need more information, let me know.
Thanks
Tim!
kubernetes gpu cluster-computing slurm docker-datacenter
kubernetes gpu cluster-computing slurm docker-datacenter
edited Nov 23 '18 at 11:53
damienfrancois
25.8k54762
25.8k54762
asked Nov 23 '18 at 10:33
Tim J.Tim J.
111
111
Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?
– Dmitri Chubarov
Nov 28 '18 at 15:44
Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?
– Tim J.
Dec 3 '18 at 13:46
add a comment |
Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?
– Dmitri Chubarov
Nov 28 '18 at 15:44
Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?
– Tim J.
Dec 3 '18 at 13:46
Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?
– Dmitri Chubarov
Nov 28 '18 at 15:44
Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?
– Dmitri Chubarov
Nov 28 '18 at 15:44
Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?
– Tim J.
Dec 3 '18 at 13:46
Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?
– Tim J.
Dec 3 '18 at 13:46
add a comment |
1 Answer
1
active
oldest
votes
As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.
There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757
I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.
This can be used in a following way:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.
I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.
Also you can read about Schedule GPUs which is still experimental.
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53445020%2fsetting-up-a-multi-user-job-scheduler-for-data-science-ml-tasks%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.
There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757
I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.
This can be used in a following way:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.
I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.
Also you can read about Schedule GPUs which is still experimental.
add a comment |
As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.
There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757
I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.
This can be used in a following way:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.
I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.
Also you can read about Schedule GPUs which is still experimental.
add a comment |
As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.
There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757
I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.
This can be used in a following way:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.
I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.
Also you can read about Schedule GPUs which is still experimental.
As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.
There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757
I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.
This can be used in a following way:
apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.
I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.
Also you can read about Schedule GPUs which is still experimental.
answered Nov 23 '18 at 16:09
CrouCrou
677610
677610
add a comment |
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53445020%2fsetting-up-a-multi-user-job-scheduler-for-data-science-ml-tasks%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?
– Dmitri Chubarov
Nov 28 '18 at 15:44
Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?
– Tim J.
Dec 3 '18 at 13:46