Efficient random sampling in R











up vote
3
down vote

favorite













From a data frame, I am trying randomly sample 1:20 observations where for
each number of observation I would like to replicate the process 4 times. I
came up with this working solution, but it is very slow since it is
involving coping many times a large data frame because of the crossing()
function. Anyone can point me toward a more efficient solution?



library(tidyverse)

mtcars %>%
group_by(cyl) %>%
nest() %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = map2_dbl(data, n_random_sample, function(data, n) {

data %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)

}))
#> # A tibble: 240 x 5
#> cyl data n_random_sample n_replicate res
#> <dbl> <list> <int> <int> <dbl>
#> 1 6 <tibble [7 × 10]> 1 1 17.8
#> 2 6 <tibble [7 × 10]> 1 2 21
#> 3 6 <tibble [7 × 10]> 1 3 19.2
#> 4 6 <tibble [7 × 10]> 1 4 18.1
#> 5 6 <tibble [7 × 10]> 2 1 19.6
#> 6 6 <tibble [7 × 10]> 2 2 19.4
#> 7 6 <tibble [7 × 10]> 2 3 19.6
#> 8 6 <tibble [7 × 10]> 2 4 20.4
#> 9 6 <tibble [7 × 10]> 3 1 20.1
#> 10 6 <tibble [7 × 10]> 3 2 18.9
#> # ... with 230 more rows


Created on 2018-11-19 by the reprex package (v0.2.1)










share|improve this question


















  • 1




    Just sample row-indices and use a subset of your data each time rather than copying the data.
    – Gregor
    Nov 19 at 14:36















up vote
3
down vote

favorite













From a data frame, I am trying randomly sample 1:20 observations where for
each number of observation I would like to replicate the process 4 times. I
came up with this working solution, but it is very slow since it is
involving coping many times a large data frame because of the crossing()
function. Anyone can point me toward a more efficient solution?



library(tidyverse)

mtcars %>%
group_by(cyl) %>%
nest() %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = map2_dbl(data, n_random_sample, function(data, n) {

data %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)

}))
#> # A tibble: 240 x 5
#> cyl data n_random_sample n_replicate res
#> <dbl> <list> <int> <int> <dbl>
#> 1 6 <tibble [7 × 10]> 1 1 17.8
#> 2 6 <tibble [7 × 10]> 1 2 21
#> 3 6 <tibble [7 × 10]> 1 3 19.2
#> 4 6 <tibble [7 × 10]> 1 4 18.1
#> 5 6 <tibble [7 × 10]> 2 1 19.6
#> 6 6 <tibble [7 × 10]> 2 2 19.4
#> 7 6 <tibble [7 × 10]> 2 3 19.6
#> 8 6 <tibble [7 × 10]> 2 4 20.4
#> 9 6 <tibble [7 × 10]> 3 1 20.1
#> 10 6 <tibble [7 × 10]> 3 2 18.9
#> # ... with 230 more rows


Created on 2018-11-19 by the reprex package (v0.2.1)










share|improve this question


















  • 1




    Just sample row-indices and use a subset of your data each time rather than copying the data.
    – Gregor
    Nov 19 at 14:36













up vote
3
down vote

favorite









up vote
3
down vote

favorite












From a data frame, I am trying randomly sample 1:20 observations where for
each number of observation I would like to replicate the process 4 times. I
came up with this working solution, but it is very slow since it is
involving coping many times a large data frame because of the crossing()
function. Anyone can point me toward a more efficient solution?



library(tidyverse)

mtcars %>%
group_by(cyl) %>%
nest() %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = map2_dbl(data, n_random_sample, function(data, n) {

data %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)

}))
#> # A tibble: 240 x 5
#> cyl data n_random_sample n_replicate res
#> <dbl> <list> <int> <int> <dbl>
#> 1 6 <tibble [7 × 10]> 1 1 17.8
#> 2 6 <tibble [7 × 10]> 1 2 21
#> 3 6 <tibble [7 × 10]> 1 3 19.2
#> 4 6 <tibble [7 × 10]> 1 4 18.1
#> 5 6 <tibble [7 × 10]> 2 1 19.6
#> 6 6 <tibble [7 × 10]> 2 2 19.4
#> 7 6 <tibble [7 × 10]> 2 3 19.6
#> 8 6 <tibble [7 × 10]> 2 4 20.4
#> 9 6 <tibble [7 × 10]> 3 1 20.1
#> 10 6 <tibble [7 × 10]> 3 2 18.9
#> # ... with 230 more rows


Created on 2018-11-19 by the reprex package (v0.2.1)










share|improve this question














From a data frame, I am trying randomly sample 1:20 observations where for
each number of observation I would like to replicate the process 4 times. I
came up with this working solution, but it is very slow since it is
involving coping many times a large data frame because of the crossing()
function. Anyone can point me toward a more efficient solution?



library(tidyverse)

mtcars %>%
group_by(cyl) %>%
nest() %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = map2_dbl(data, n_random_sample, function(data, n) {

data %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)

}))
#> # A tibble: 240 x 5
#> cyl data n_random_sample n_replicate res
#> <dbl> <list> <int> <int> <dbl>
#> 1 6 <tibble [7 × 10]> 1 1 17.8
#> 2 6 <tibble [7 × 10]> 1 2 21
#> 3 6 <tibble [7 × 10]> 1 3 19.2
#> 4 6 <tibble [7 × 10]> 1 4 18.1
#> 5 6 <tibble [7 × 10]> 2 1 19.6
#> 6 6 <tibble [7 × 10]> 2 2 19.4
#> 7 6 <tibble [7 × 10]> 2 3 19.6
#> 8 6 <tibble [7 × 10]> 2 4 20.4
#> 9 6 <tibble [7 × 10]> 3 1 20.1
#> 10 6 <tibble [7 × 10]> 3 2 18.9
#> # ... with 230 more rows


Created on 2018-11-19 by the reprex package (v0.2.1)







r data.table tidyverse






share|improve this question













share|improve this question











share|improve this question




share|improve this question










asked Nov 19 at 14:18









Philippe Massicotte

1979




1979








  • 1




    Just sample row-indices and use a subset of your data each time rather than copying the data.
    – Gregor
    Nov 19 at 14:36














  • 1




    Just sample row-indices and use a subset of your data each time rather than copying the data.
    – Gregor
    Nov 19 at 14:36








1




1




Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36




Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36












2 Answers
2






active

oldest

votes

















up vote
3
down vote



accepted










This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest to create the sub-datasets and store them as a list variable and then pick a sample using map:



library(tidyverse)

# create function to sample rows
f = function(c, n) {
mtcars %>%
filter(cyl == c) %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)
}

# vectorise function
f = Vectorize(f)

# set seed for reproducibility
set.seed(11)

tbl_df(mtcars) %>%
distinct(cyl) %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = f(cyl, n_random_sample))

# # A tibble: 240 x 4
# cyl n_random_sample n_replicate res
# <dbl> <int> <int> <dbl>
# 1 6 1 1 21
# 2 6 1 2 21
# 3 6 1 3 18.1
# 4 6 1 4 21
# 5 6 2 1 20.4
# 6 6 2 2 21.2
# 7 6 2 3 20.4
# 8 6 2 4 19.6
# 9 6 3 1 18.4
#10 6 3 2 19.6
# # ... with 230 more rows





share|improve this answer





















  • Works as intended, thank you.
    – Philippe Massicotte
    Nov 19 at 15:47


















up vote
1
down vote













mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)


This will give you a list of tables of nrows=1:20, each 4 times.



You can follow up with this to name the elements of the list:



names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))


Result:



head(mm,5)
$`sample.1-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

$`sample.2-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6

$`sample.3-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

$`sample.4-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1

$`sample.1-2`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2





share|improve this answer





















    Your Answer






    StackExchange.ifUsing("editor", function () {
    StackExchange.using("externalEditor", function () {
    StackExchange.using("snippets", function () {
    StackExchange.snippets.init();
    });
    });
    }, "code-snippets");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "1"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    convertImagesToLinks: true,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: 10,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














     

    draft saved


    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376574%2fefficient-random-sampling-in-r%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    2 Answers
    2






    active

    oldest

    votes








    2 Answers
    2






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes








    up vote
    3
    down vote



    accepted










    This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest to create the sub-datasets and store them as a list variable and then pick a sample using map:



    library(tidyverse)

    # create function to sample rows
    f = function(c, n) {
    mtcars %>%
    filter(cyl == c) %>%
    sample_n(n, replace = TRUE) %>%
    summarise(mean_mpg = mean(mpg)) %>%
    pull(mean_mpg)
    }

    # vectorise function
    f = Vectorize(f)

    # set seed for reproducibility
    set.seed(11)

    tbl_df(mtcars) %>%
    distinct(cyl) %>%
    crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
    mutate(res = f(cyl, n_random_sample))

    # # A tibble: 240 x 4
    # cyl n_random_sample n_replicate res
    # <dbl> <int> <int> <dbl>
    # 1 6 1 1 21
    # 2 6 1 2 21
    # 3 6 1 3 18.1
    # 4 6 1 4 21
    # 5 6 2 1 20.4
    # 6 6 2 2 21.2
    # 7 6 2 3 20.4
    # 8 6 2 4 19.6
    # 9 6 3 1 18.4
    #10 6 3 2 19.6
    # # ... with 230 more rows





    share|improve this answer





















    • Works as intended, thank you.
      – Philippe Massicotte
      Nov 19 at 15:47















    up vote
    3
    down vote



    accepted










    This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest to create the sub-datasets and store them as a list variable and then pick a sample using map:



    library(tidyverse)

    # create function to sample rows
    f = function(c, n) {
    mtcars %>%
    filter(cyl == c) %>%
    sample_n(n, replace = TRUE) %>%
    summarise(mean_mpg = mean(mpg)) %>%
    pull(mean_mpg)
    }

    # vectorise function
    f = Vectorize(f)

    # set seed for reproducibility
    set.seed(11)

    tbl_df(mtcars) %>%
    distinct(cyl) %>%
    crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
    mutate(res = f(cyl, n_random_sample))

    # # A tibble: 240 x 4
    # cyl n_random_sample n_replicate res
    # <dbl> <int> <int> <dbl>
    # 1 6 1 1 21
    # 2 6 1 2 21
    # 3 6 1 3 18.1
    # 4 6 1 4 21
    # 5 6 2 1 20.4
    # 6 6 2 2 21.2
    # 7 6 2 3 20.4
    # 8 6 2 4 19.6
    # 9 6 3 1 18.4
    #10 6 3 2 19.6
    # # ... with 230 more rows





    share|improve this answer





















    • Works as intended, thank you.
      – Philippe Massicotte
      Nov 19 at 15:47













    up vote
    3
    down vote



    accepted







    up vote
    3
    down vote



    accepted






    This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest to create the sub-datasets and store them as a list variable and then pick a sample using map:



    library(tidyverse)

    # create function to sample rows
    f = function(c, n) {
    mtcars %>%
    filter(cyl == c) %>%
    sample_n(n, replace = TRUE) %>%
    summarise(mean_mpg = mean(mpg)) %>%
    pull(mean_mpg)
    }

    # vectorise function
    f = Vectorize(f)

    # set seed for reproducibility
    set.seed(11)

    tbl_df(mtcars) %>%
    distinct(cyl) %>%
    crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
    mutate(res = f(cyl, n_random_sample))

    # # A tibble: 240 x 4
    # cyl n_random_sample n_replicate res
    # <dbl> <int> <int> <dbl>
    # 1 6 1 1 21
    # 2 6 1 2 21
    # 3 6 1 3 18.1
    # 4 6 1 4 21
    # 5 6 2 1 20.4
    # 6 6 2 2 21.2
    # 7 6 2 3 20.4
    # 8 6 2 4 19.6
    # 9 6 3 1 18.4
    #10 6 3 2 19.6
    # # ... with 230 more rows





    share|improve this answer












    This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest to create the sub-datasets and store them as a list variable and then pick a sample using map:



    library(tidyverse)

    # create function to sample rows
    f = function(c, n) {
    mtcars %>%
    filter(cyl == c) %>%
    sample_n(n, replace = TRUE) %>%
    summarise(mean_mpg = mean(mpg)) %>%
    pull(mean_mpg)
    }

    # vectorise function
    f = Vectorize(f)

    # set seed for reproducibility
    set.seed(11)

    tbl_df(mtcars) %>%
    distinct(cyl) %>%
    crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
    mutate(res = f(cyl, n_random_sample))

    # # A tibble: 240 x 4
    # cyl n_random_sample n_replicate res
    # <dbl> <int> <int> <dbl>
    # 1 6 1 1 21
    # 2 6 1 2 21
    # 3 6 1 3 18.1
    # 4 6 1 4 21
    # 5 6 2 1 20.4
    # 6 6 2 2 21.2
    # 7 6 2 3 20.4
    # 8 6 2 4 19.6
    # 9 6 3 1 18.4
    #10 6 3 2 19.6
    # # ... with 230 more rows






    share|improve this answer












    share|improve this answer



    share|improve this answer










    answered Nov 19 at 14:51









    AntoniosK

    12.1k1822




    12.1k1822












    • Works as intended, thank you.
      – Philippe Massicotte
      Nov 19 at 15:47


















    • Works as intended, thank you.
      – Philippe Massicotte
      Nov 19 at 15:47
















    Works as intended, thank you.
    – Philippe Massicotte
    Nov 19 at 15:47




    Works as intended, thank you.
    – Philippe Massicotte
    Nov 19 at 15:47












    up vote
    1
    down vote













    mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)


    This will give you a list of tables of nrows=1:20, each 4 times.



    You can follow up with this to name the elements of the list:



    names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))


    Result:



    head(mm,5)
    $`sample.1-1`
    mpg cyl disp hp drat wt qsec vs am gear carb
    Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

    $`sample.2-1`
    mpg cyl disp hp drat wt qsec vs am gear carb
    Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6

    $`sample.3-1`
    mpg cyl disp hp drat wt qsec vs am gear carb
    Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

    $`sample.4-1`
    mpg cyl disp hp drat wt qsec vs am gear carb
    Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1

    $`sample.1-2`
    mpg cyl disp hp drat wt qsec vs am gear carb
    Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
    Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2





    share|improve this answer

























      up vote
      1
      down vote













      mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)


      This will give you a list of tables of nrows=1:20, each 4 times.



      You can follow up with this to name the elements of the list:



      names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))


      Result:



      head(mm,5)
      $`sample.1-1`
      mpg cyl disp hp drat wt qsec vs am gear carb
      Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

      $`sample.2-1`
      mpg cyl disp hp drat wt qsec vs am gear carb
      Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6

      $`sample.3-1`
      mpg cyl disp hp drat wt qsec vs am gear carb
      Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

      $`sample.4-1`
      mpg cyl disp hp drat wt qsec vs am gear carb
      Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1

      $`sample.1-2`
      mpg cyl disp hp drat wt qsec vs am gear carb
      Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
      Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2





      share|improve this answer























        up vote
        1
        down vote










        up vote
        1
        down vote









        mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)


        This will give you a list of tables of nrows=1:20, each 4 times.



        You can follow up with this to name the elements of the list:



        names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))


        Result:



        head(mm,5)
        $`sample.1-1`
        mpg cyl disp hp drat wt qsec vs am gear carb
        Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

        $`sample.2-1`
        mpg cyl disp hp drat wt qsec vs am gear carb
        Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6

        $`sample.3-1`
        mpg cyl disp hp drat wt qsec vs am gear carb
        Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

        $`sample.4-1`
        mpg cyl disp hp drat wt qsec vs am gear carb
        Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1

        $`sample.1-2`
        mpg cyl disp hp drat wt qsec vs am gear carb
        Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
        Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2





        share|improve this answer












        mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)


        This will give you a list of tables of nrows=1:20, each 4 times.



        You can follow up with this to name the elements of the list:



        names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))


        Result:



        head(mm,5)
        $`sample.1-1`
        mpg cyl disp hp drat wt qsec vs am gear carb
        Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2

        $`sample.2-1`
        mpg cyl disp hp drat wt qsec vs am gear carb
        Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6

        $`sample.3-1`
        mpg cyl disp hp drat wt qsec vs am gear carb
        Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

        $`sample.4-1`
        mpg cyl disp hp drat wt qsec vs am gear carb
        Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1

        $`sample.1-2`
        mpg cyl disp hp drat wt qsec vs am gear carb
        Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
        Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2






        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Nov 19 at 14:39









        iod

        2,9671619




        2,9671619






























             

            draft saved


            draft discarded



















































             


            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376574%2fefficient-random-sampling-in-r%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            Wiesbaden

            Marschland

            Dieringhausen