Setting up a multi-user job scheduler for data science / ML tasks

Background

Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.

Since ML is growing here. I am looking for a better way to make use of our infrastucture.

Requierments

Multi-user. PhDs and students should be able to run their tasks.

Job queue or scheduling (preferably something like time-sliced scheduling)

Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.

Easy / Remote job submission: Maybe a webpage?

What I tried so far

I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.

First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.

SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.

In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.

My question

Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.

If you need more information, let me know.

Thanks
Tim!

edited Nov 23 '18 at 11:53

damienfrancois

25.8k54762

asked Nov 23 '18 at 10:33

Tim J.

111

Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?

– Dmitri Chubarov
Nov 28 '18 at 15:44

Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?

– Tim J.
Dec 3 '18 at 13:46

add a comment |

Background

Since ML is growing here. I am looking for a better way to make use of our infrastucture.

Requierments

Multi-user. PhDs and students should be able to run their tasks.

Job queue or scheduling (preferably something like time-sliced scheduling)

Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.

Easy / Remote job submission: Maybe a webpage?

What I tried so far

I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.

First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.

SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.

My question

Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.

If you need more information, let me know.

Thanks
Tim!

edited Nov 23 '18 at 11:53

damienfrancois

25.8k54762

asked Nov 23 '18 at 10:33

Tim J.

111

Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?

– Dmitri Chubarov
Nov 28 '18 at 15:44

Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?

– Tim J.
Dec 3 '18 at 13:46

add a comment |

Background

Since ML is growing here. I am looking for a better way to make use of our infrastucture.

Requierments

Multi-user. PhDs and students should be able to run their tasks.

Job queue or scheduling (preferably something like time-sliced scheduling)

Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.

Easy / Remote job submission: Maybe a webpage?

What I tried so far

I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.

First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.

SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.

My question

Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.

If you need more information, let me know.

Thanks
Tim!

edited Nov 23 '18 at 11:53

damienfrancois

25.8k54762

asked Nov 23 '18 at 10:33

Tim J.

111

Background

Since ML is growing here. I am looking for a better way to make use of our infrastucture.

Requierments

Multi-user. PhDs and students should be able to run their tasks.

Job queue or scheduling (preferably something like time-sliced scheduling)

Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.

Easy / Remote job submission: Maybe a webpage?

What I tried so far

I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.

First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.

SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.

My question

Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.

If you need more information, let me know.

Thanks
Tim!

kubernetes gpu cluster-computing slurm docker-datacenter

edited Nov 23 '18 at 11:53

damienfrancois

25.8k54762

asked Nov 23 '18 at 10:33

Tim J.

111

edited Nov 23 '18 at 11:53

damienfrancois

25.8k54762

asked Nov 23 '18 at 10:33

Tim J.

111

edited Nov 23 '18 at 11:53

damienfrancois

25.8k54762

edited Nov 23 '18 at 11:53

damienfrancois

25.8k54762

edited Nov 23 '18 at 11:53

damienfrancois

25.8k54762

asked Nov 23 '18 at 10:33

Tim J.

111

asked Nov 23 '18 at 10:33

Tim J.

111

asked Nov 23 '18 at 10:33

Tim J.

111

Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?

– Dmitri Chubarov
Nov 28 '18 at 15:44

Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?

– Tim J.
Dec 3 '18 at 13:46

add a comment |

Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?

– Dmitri Chubarov
Nov 28 '18 at 15:44

Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?

– Tim J.
Dec 3 '18 at 13:46

Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?

– Dmitri Chubarov
Nov 28 '18 at 15:44

Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?

– Tim J.
Dec 3 '18 at 13:46

add a comment |

1 Answer
1

active

oldest

votes

As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.

There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757

I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.

This can be used in a following way:

apiVersion: v1 kind: Pod metadata: name: gpu-pod spec: containers: - name: cuda-container image: nvidia/cuda:9.0-devel resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs - name: digits-container image: nvidia/digits:6.0 resources: limits: nvidia.com/gpu: 2 # requesting 2 GPUs

That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.

I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.

Also you can read about Schedule GPUs which is still experimental.

answered Nov 23 '18 at 16:09

Crou

677610

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53445020%2fsetting-up-a-multi-user-job-scheduler-for-data-science-ml-tasks%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.

There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757

I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.

This can be used in a following way:

That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.

I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.

Also you can read about Schedule GPUs which is still experimental.

answered Nov 23 '18 at 16:09

Crou

677610

add a comment |

As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.

There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757

I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.

This can be used in a following way:

That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.

I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.

Also you can read about Schedule GPUs which is still experimental.

answered Nov 23 '18 at 16:09

Crou

677610

add a comment |

As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.

There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757

I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.

This can be used in a following way:

That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.

I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.

Also you can read about Schedule GPUs which is still experimental.

answered Nov 23 '18 at 16:09

Crou

677610

As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.

There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757

I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.

This can be used in a following way:

That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.

I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.

Also you can read about Schedule GPUs which is still experimental.

answered Nov 23 '18 at 16:09

Crou

677610

answered Nov 23 '18 at 16:09

Crou

677610

answered Nov 23 '18 at 16:09

Crou

677610

answered Nov 23 '18 at 16:09

Crou

677610

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

4JloE8h2cuMA42 xtAL4k,Aa 2NpRm,7l9S0gYxh4cnXSfGVEZq zdD89ItufD2V3qwtZnCeAcY5Bj4j,g76T1O,v97uO4

搜尋此網誌

Ytukyg