Setting up a multi-user job scheduler for data science / ML tasks












2















Background



Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.



Since ML is growing here. I am looking for a better way to make use of our infrastucture.



Requierments




  • Multi-user. PhDs and students should be able to run their tasks.

  • Job queue or scheduling (preferably something like time-sliced scheduling)

  • Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.

  • Easy / Remote job submission: Maybe a webpage?


What I tried so far



I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.



First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.



SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.



In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.



My question



Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.



If you need more information, let me know.



Thanks
Tim!










share|improve this question

























  • Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?

    – Dmitri Chubarov
    Nov 28 '18 at 15:44











  • Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?

    – Tim J.
    Dec 3 '18 at 13:46
















2















Background



Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.



Since ML is growing here. I am looking for a better way to make use of our infrastucture.



Requierments




  • Multi-user. PhDs and students should be able to run their tasks.

  • Job queue or scheduling (preferably something like time-sliced scheduling)

  • Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.

  • Easy / Remote job submission: Maybe a webpage?


What I tried so far



I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.



First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.



SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.



In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.



My question



Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.



If you need more information, let me know.



Thanks
Tim!










share|improve this question

























  • Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?

    – Dmitri Chubarov
    Nov 28 '18 at 15:44











  • Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?

    – Tim J.
    Dec 3 '18 at 13:46














2












2








2








Background



Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.



Since ML is growing here. I am looking for a better way to make use of our infrastucture.



Requierments




  • Multi-user. PhDs and students should be able to run their tasks.

  • Job queue or scheduling (preferably something like time-sliced scheduling)

  • Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.

  • Easy / Remote job submission: Maybe a webpage?


What I tried so far



I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.



First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.



SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.



In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.



My question



Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.



If you need more information, let me know.



Thanks
Tim!










share|improve this question
















Background



Recently my lab invested in GPU computation infrastructure. More specific: two TitanV installed in a standard server machine. Currently the machine is running a not at all configured Windows Server. Everyone from my lab can login and do whatever they want. From time to time it happens that the machine is completely useless for others, because someone accidentally occupied all available memory.



Since ML is growing here. I am looking for a better way to make use of our infrastucture.



Requierments




  • Multi-user. PhDs and students should be able to run their tasks.

  • Job queue or scheduling (preferably something like time-sliced scheduling)

  • Dynamic allocation of resources. If a single task is running it is ok to utilize the whole memory, but as soon as a secound one is started they should share the resources.

  • Easy / Remote job submission: Maybe a webpage?


What I tried so far



I have a small test setup (consumer PC with GTX 1070) for experimenting. My internet research pointed me to SLURM and Kubernetes.



First of all I like the idea of a cluster management system, since it offers the option to extend the infrastructure in future.



SLURM was fairly easy to setup, but I was not able to setup something like a remote submission or a time-slice scheduling.



In the meanwhile I also tried to work with Kubernetes. To me it offers way more interesting features, above all the containerization. However, all these features makes it more complicated to setup and understand. And again I was not able to build something like a remote submission.



My question



Has someone faced the same problem and can report his/her solution? I have the feeling that Kubernetes is better prepared for the future.



If you need more information, let me know.



Thanks
Tim!







kubernetes gpu cluster-computing slurm docker-datacenter






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Nov 23 '18 at 11:53









damienfrancois

25.8k54762




25.8k54762










asked Nov 23 '18 at 10:33









Tim J.Tim J.

111




111













  • Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?

    – Dmitri Chubarov
    Nov 28 '18 at 15:44











  • Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?

    – Tim J.
    Dec 3 '18 at 13:46



















  • Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?

    – Dmitri Chubarov
    Nov 28 '18 at 15:44











  • Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?

    – Tim J.
    Dec 3 '18 at 13:46

















Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?

– Dmitri Chubarov
Nov 28 '18 at 15:44





Can you specify in more details what exactly do your users normally do with the machine? Do they mostly use Machine Learning frameworks, such as Tensorflow? Do they run experiments in Jupyter notebooks or mostly run stable Python scripts where they have in advance a rough idea of the duration of each run? What are the reasons you are considering adding K8s as an extra layer?

– Dmitri Chubarov
Nov 28 '18 at 15:44













Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?

– Tim J.
Dec 3 '18 at 13:46





Hi Dmitri, they will use only machine learning frameworks. Jupyter notebook is a nice to have feature. Maybe I got the idea behind K8s wrong?

– Tim J.
Dec 3 '18 at 13:46












1 Answer
1






active

oldest

votes


















0














As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.



There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757



I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.



This can be used in a following way:



apiVersion: v1
kind: Pod
metadata:
name: gpu-pod
spec:
containers:
- name: cuda-container
image: nvidia/cuda:9.0-devel
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs
- name: digits-container
image: nvidia/digits:6.0
resources:
limits:
nvidia.com/gpu: 2 # requesting 2 GPUs



That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.



I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.



Also you can read about Schedule GPUs which is still experimental.






share|improve this answer























    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53445020%2fsetting-up-a-multi-user-job-scheduler-for-data-science-ml-tasks%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.



    There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757



    I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.



    This can be used in a following way:



    apiVersion: v1
    kind: Pod
    metadata:
    name: gpu-pod
    spec:
    containers:
    - name: cuda-container
    image: nvidia/cuda:9.0-devel
    resources:
    limits:
    nvidia.com/gpu: 2 # requesting 2 GPUs
    - name: digits-container
    image: nvidia/digits:6.0
    resources:
    limits:
    nvidia.com/gpu: 2 # requesting 2 GPUs



    That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.



    I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.



    Also you can read about Schedule GPUs which is still experimental.






    share|improve this answer




























      0














      As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.



      There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757



      I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.



      This can be used in a following way:



      apiVersion: v1
      kind: Pod
      metadata:
      name: gpu-pod
      spec:
      containers:
      - name: cuda-container
      image: nvidia/cuda:9.0-devel
      resources:
      limits:
      nvidia.com/gpu: 2 # requesting 2 GPUs
      - name: digits-container
      image: nvidia/digits:6.0
      resources:
      limits:
      nvidia.com/gpu: 2 # requesting 2 GPUs



      That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.



      I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.



      Also you can read about Schedule GPUs which is still experimental.






      share|improve this answer


























        0












        0








        0







        As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.



        There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757



        I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.



        This can be used in a following way:



        apiVersion: v1
        kind: Pod
        metadata:
        name: gpu-pod
        spec:
        containers:
        - name: cuda-container
        image: nvidia/cuda:9.0-devel
        resources:
        limits:
        nvidia.com/gpu: 2 # requesting 2 GPUs
        - name: digits-container
        image: nvidia/digits:6.0
        resources:
        limits:
        nvidia.com/gpu: 2 # requesting 2 GPUs



        That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.



        I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.



        Also you can read about Schedule GPUs which is still experimental.






        share|improve this answer













        As far as my knowledge goes, Kubernetes does not support sharing of GPU, which was asked here.



        There is an ongoing discussion Is sharing GPU to multiple containers feasible? #52757



        I was able to find a docker image with examples which "support share GPUs unofficially", available here cvaldit/nvidia-k8s-device-plugin.



        This can be used in a following way:



        apiVersion: v1
        kind: Pod
        metadata:
        name: gpu-pod
        spec:
        containers:
        - name: cuda-container
        image: nvidia/cuda:9.0-devel
        resources:
        limits:
        nvidia.com/gpu: 2 # requesting 2 GPUs
        - name: digits-container
        image: nvidia/digits:6.0
        resources:
        limits:
        nvidia.com/gpu: 2 # requesting 2 GPUs



        That would expose 2 GPUs inside the container to run your job in, also locking those 2 GPUs from further use until job ends.



        I'm not sure how would you scale those for multiple users, in other way then limiting them the maximum amount of used GPUs per job.



        Also you can read about Schedule GPUs which is still experimental.







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 23 '18 at 16:09









        CrouCrou

        677610




        677610
































            draft saved

            draft discarded




















































            Thanks for contributing an answer to Stack Overflow!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53445020%2fsetting-up-a-multi-user-job-scheduler-for-data-science-ml-tasks%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Wiesbaden

            Marschland

            Dieringhausen