Pytorch DataLoader multiple data source












1















I am trying to use Pytorch dataloader to define my own dataset, but I am not sure how to load multiple data source:



My current code:



class MultipleSourceDataSet(Dataset):
def __init__ (self, json_file, root_dir, transform = None):
with open(root_dir + 'block0.json') as f:
self.result = torch.Tensor(json.load(f))

self.root_dir = root_dir
self.transform = transform

def __len__(self):
return len(self.result[0])

def __getitem__ (self):
None


The data source is 50 blocks under root_dir = ~/Documents/blocks/



I split them and avoid to combine them directly before since this is a very big dataset.



How can I load them into a single dataloader?










share|improve this question





























    1















    I am trying to use Pytorch dataloader to define my own dataset, but I am not sure how to load multiple data source:



    My current code:



    class MultipleSourceDataSet(Dataset):
    def __init__ (self, json_file, root_dir, transform = None):
    with open(root_dir + 'block0.json') as f:
    self.result = torch.Tensor(json.load(f))

    self.root_dir = root_dir
    self.transform = transform

    def __len__(self):
    return len(self.result[0])

    def __getitem__ (self):
    None


    The data source is 50 blocks under root_dir = ~/Documents/blocks/



    I split them and avoid to combine them directly before since this is a very big dataset.



    How can I load them into a single dataloader?










    share|improve this question



























      1












      1








      1








      I am trying to use Pytorch dataloader to define my own dataset, but I am not sure how to load multiple data source:



      My current code:



      class MultipleSourceDataSet(Dataset):
      def __init__ (self, json_file, root_dir, transform = None):
      with open(root_dir + 'block0.json') as f:
      self.result = torch.Tensor(json.load(f))

      self.root_dir = root_dir
      self.transform = transform

      def __len__(self):
      return len(self.result[0])

      def __getitem__ (self):
      None


      The data source is 50 blocks under root_dir = ~/Documents/blocks/



      I split them and avoid to combine them directly before since this is a very big dataset.



      How can I load them into a single dataloader?










      share|improve this question
















      I am trying to use Pytorch dataloader to define my own dataset, but I am not sure how to load multiple data source:



      My current code:



      class MultipleSourceDataSet(Dataset):
      def __init__ (self, json_file, root_dir, transform = None):
      with open(root_dir + 'block0.json') as f:
      self.result = torch.Tensor(json.load(f))

      self.root_dir = root_dir
      self.transform = transform

      def __len__(self):
      return len(self.result[0])

      def __getitem__ (self):
      None


      The data source is 50 blocks under root_dir = ~/Documents/blocks/



      I split them and avoid to combine them directly before since this is a very big dataset.



      How can I load them into a single dataloader?







      python-3.x image-processing machine-learning deep-learning pytorch






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 27 '18 at 8:11









      Shai

      70.7k23138247




      70.7k23138247










      asked Nov 26 '18 at 9:14









      sealpuppysealpuppy

      739




      739
























          2 Answers
          2






          active

          oldest

          votes


















          2














          For DataLoader you need to have a single Dataset, your problem is that you have multiple 'json' files and you only know how to create a Dataset from each 'json' separately.

          What you can do in this case is to use ConcatDataset that contains all the single-'json' datasets you create:



          import os
          import torch.utils.data as data

          class SingeJsonDataset(data.Dataset):
          # implement a single json dataset here...

          list_of_datasets =
          for j in os.path.listdir(root_dir):
          if not j.endswith('.json'):
          continue # skip non-json files
          list_of_datasets.append(SingeJsonDataset(json_file=j, root_dir=root_dir, transform=None))
          # once all single json datasets are created you can concat them into a single one:
          multiple_json_dataset = data.ConcatDataset(list_of_datasets)


          Now you can feed the concatenated dataset into data.DataLoader.






          share|improve this answer



















          • 1





            Thank you. This is a very detailed explanation. My problem is that if I concatenate all .json files, the file will become too big that it may eventually crash. However, I will still try this solution anyway. Thanks a lot!

            – sealpuppy
            Nov 26 '18 at 15:11



















          0














          I should revise my question as 2 different sub-questions:




          1. How to deal with large datasets in PyTorch to avoid memory error


          2. If I am separating large a dataset into small chunks, how can I load multiple mini-datasets



            For question 1:



            PyTorch DataLoader can prevent this issue by creating mini-batches. Here you can find further explanations.



            For question 2:



            Please refer to Shai's answer above.








          share|improve this answer
























            Your Answer






            StackExchange.ifUsing("editor", function () {
            StackExchange.using("externalEditor", function () {
            StackExchange.using("snippets", function () {
            StackExchange.snippets.init();
            });
            });
            }, "code-snippets");

            StackExchange.ready(function() {
            var channelOptions = {
            tags: "".split(" "),
            id: "1"
            };
            initTagRenderer("".split(" "), "".split(" "), channelOptions);

            StackExchange.using("externalEditor", function() {
            // Have to fire editor after snippets, if snippets enabled
            if (StackExchange.settings.snippets.snippetsEnabled) {
            StackExchange.using("snippets", function() {
            createEditor();
            });
            }
            else {
            createEditor();
            }
            });

            function createEditor() {
            StackExchange.prepareEditor({
            heartbeatType: 'answer',
            autoActivateHeartbeat: false,
            convertImagesToLinks: true,
            noModals: true,
            showLowRepImageUploadWarning: true,
            reputationToPostImages: 10,
            bindNavPrevention: true,
            postfix: "",
            imageUploader: {
            brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
            contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
            allowUrls: true
            },
            onDemand: true,
            discardSelector: ".discard-answer"
            ,immediatelyShowMarkdownHelp:true
            });


            }
            });














            draft saved

            draft discarded


















            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53477861%2fpytorch-dataloader-multiple-data-source%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown

























            2 Answers
            2






            active

            oldest

            votes








            2 Answers
            2






            active

            oldest

            votes









            active

            oldest

            votes






            active

            oldest

            votes









            2














            For DataLoader you need to have a single Dataset, your problem is that you have multiple 'json' files and you only know how to create a Dataset from each 'json' separately.

            What you can do in this case is to use ConcatDataset that contains all the single-'json' datasets you create:



            import os
            import torch.utils.data as data

            class SingeJsonDataset(data.Dataset):
            # implement a single json dataset here...

            list_of_datasets =
            for j in os.path.listdir(root_dir):
            if not j.endswith('.json'):
            continue # skip non-json files
            list_of_datasets.append(SingeJsonDataset(json_file=j, root_dir=root_dir, transform=None))
            # once all single json datasets are created you can concat them into a single one:
            multiple_json_dataset = data.ConcatDataset(list_of_datasets)


            Now you can feed the concatenated dataset into data.DataLoader.






            share|improve this answer



















            • 1





              Thank you. This is a very detailed explanation. My problem is that if I concatenate all .json files, the file will become too big that it may eventually crash. However, I will still try this solution anyway. Thanks a lot!

              – sealpuppy
              Nov 26 '18 at 15:11
















            2














            For DataLoader you need to have a single Dataset, your problem is that you have multiple 'json' files and you only know how to create a Dataset from each 'json' separately.

            What you can do in this case is to use ConcatDataset that contains all the single-'json' datasets you create:



            import os
            import torch.utils.data as data

            class SingeJsonDataset(data.Dataset):
            # implement a single json dataset here...

            list_of_datasets =
            for j in os.path.listdir(root_dir):
            if not j.endswith('.json'):
            continue # skip non-json files
            list_of_datasets.append(SingeJsonDataset(json_file=j, root_dir=root_dir, transform=None))
            # once all single json datasets are created you can concat them into a single one:
            multiple_json_dataset = data.ConcatDataset(list_of_datasets)


            Now you can feed the concatenated dataset into data.DataLoader.






            share|improve this answer



















            • 1





              Thank you. This is a very detailed explanation. My problem is that if I concatenate all .json files, the file will become too big that it may eventually crash. However, I will still try this solution anyway. Thanks a lot!

              – sealpuppy
              Nov 26 '18 at 15:11














            2












            2








            2







            For DataLoader you need to have a single Dataset, your problem is that you have multiple 'json' files and you only know how to create a Dataset from each 'json' separately.

            What you can do in this case is to use ConcatDataset that contains all the single-'json' datasets you create:



            import os
            import torch.utils.data as data

            class SingeJsonDataset(data.Dataset):
            # implement a single json dataset here...

            list_of_datasets =
            for j in os.path.listdir(root_dir):
            if not j.endswith('.json'):
            continue # skip non-json files
            list_of_datasets.append(SingeJsonDataset(json_file=j, root_dir=root_dir, transform=None))
            # once all single json datasets are created you can concat them into a single one:
            multiple_json_dataset = data.ConcatDataset(list_of_datasets)


            Now you can feed the concatenated dataset into data.DataLoader.






            share|improve this answer













            For DataLoader you need to have a single Dataset, your problem is that you have multiple 'json' files and you only know how to create a Dataset from each 'json' separately.

            What you can do in this case is to use ConcatDataset that contains all the single-'json' datasets you create:



            import os
            import torch.utils.data as data

            class SingeJsonDataset(data.Dataset):
            # implement a single json dataset here...

            list_of_datasets =
            for j in os.path.listdir(root_dir):
            if not j.endswith('.json'):
            continue # skip non-json files
            list_of_datasets.append(SingeJsonDataset(json_file=j, root_dir=root_dir, transform=None))
            # once all single json datasets are created you can concat them into a single one:
            multiple_json_dataset = data.ConcatDataset(list_of_datasets)


            Now you can feed the concatenated dataset into data.DataLoader.







            share|improve this answer












            share|improve this answer



            share|improve this answer










            answered Nov 26 '18 at 13:17









            ShaiShai

            70.7k23138247




            70.7k23138247








            • 1





              Thank you. This is a very detailed explanation. My problem is that if I concatenate all .json files, the file will become too big that it may eventually crash. However, I will still try this solution anyway. Thanks a lot!

              – sealpuppy
              Nov 26 '18 at 15:11














            • 1





              Thank you. This is a very detailed explanation. My problem is that if I concatenate all .json files, the file will become too big that it may eventually crash. However, I will still try this solution anyway. Thanks a lot!

              – sealpuppy
              Nov 26 '18 at 15:11








            1




            1





            Thank you. This is a very detailed explanation. My problem is that if I concatenate all .json files, the file will become too big that it may eventually crash. However, I will still try this solution anyway. Thanks a lot!

            – sealpuppy
            Nov 26 '18 at 15:11





            Thank you. This is a very detailed explanation. My problem is that if I concatenate all .json files, the file will become too big that it may eventually crash. However, I will still try this solution anyway. Thanks a lot!

            – sealpuppy
            Nov 26 '18 at 15:11













            0














            I should revise my question as 2 different sub-questions:




            1. How to deal with large datasets in PyTorch to avoid memory error


            2. If I am separating large a dataset into small chunks, how can I load multiple mini-datasets



              For question 1:



              PyTorch DataLoader can prevent this issue by creating mini-batches. Here you can find further explanations.



              For question 2:



              Please refer to Shai's answer above.








            share|improve this answer




























              0














              I should revise my question as 2 different sub-questions:




              1. How to deal with large datasets in PyTorch to avoid memory error


              2. If I am separating large a dataset into small chunks, how can I load multiple mini-datasets



                For question 1:



                PyTorch DataLoader can prevent this issue by creating mini-batches. Here you can find further explanations.



                For question 2:



                Please refer to Shai's answer above.








              share|improve this answer


























                0












                0








                0







                I should revise my question as 2 different sub-questions:




                1. How to deal with large datasets in PyTorch to avoid memory error


                2. If I am separating large a dataset into small chunks, how can I load multiple mini-datasets



                  For question 1:



                  PyTorch DataLoader can prevent this issue by creating mini-batches. Here you can find further explanations.



                  For question 2:



                  Please refer to Shai's answer above.








                share|improve this answer













                I should revise my question as 2 different sub-questions:




                1. How to deal with large datasets in PyTorch to avoid memory error


                2. If I am separating large a dataset into small chunks, how can I load multiple mini-datasets



                  For question 1:



                  PyTorch DataLoader can prevent this issue by creating mini-batches. Here you can find further explanations.



                  For question 2:



                  Please refer to Shai's answer above.









                share|improve this answer












                share|improve this answer



                share|improve this answer










                answered Nov 28 '18 at 6:50









                sealpuppysealpuppy

                739




                739






























                    draft saved

                    draft discarded




















































                    Thanks for contributing an answer to Stack Overflow!


                    • Please be sure to answer the question. Provide details and share your research!

                    But avoid



                    • Asking for help, clarification, or responding to other answers.

                    • Making statements based on opinion; back them up with references or personal experience.


                    To learn more, see our tips on writing great answers.




                    draft saved


                    draft discarded














                    StackExchange.ready(
                    function () {
                    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53477861%2fpytorch-dataloader-multiple-data-source%23new-answer', 'question_page');
                    }
                    );

                    Post as a guest















                    Required, but never shown





















































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown

































                    Required, but never shown














                    Required, but never shown












                    Required, but never shown







                    Required, but never shown







                    Popular posts from this blog

                    Wiesbaden

                    Marschland

                    Dieringhausen