Efficient random sampling in R

up vote
3
down vote

favorite

From a data frame, I am trying randomly sample 1:20 observations where for
each number of observation I would like to replicate the process 4 times. I
came up with this working solution, but it is very slow since it is
involving coping many times a large data frame because of the crossing()
function. Anyone can point me toward a more efficient solution?

library(tidyverse)



mtcars %>% 

  group_by(cyl) %>% 

  nest() %>% 

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>% 

  mutate(res = map2_dbl(data, n_random_sample, function(data, n) {



    data %>%

      sample_n(n, replace = TRUE) %>%

      summarise(mean_mpg = mean(mpg)) %>%

      pull(mean_mpg)



  }))

#> # A tibble: 240 x 5

#>      cyl data              n_random_sample n_replicate   res

#>    <dbl> <list>                      <int>       <int> <dbl>

#>  1     6 <tibble [7 × 10]>               1           1  17.8

#>  2     6 <tibble [7 × 10]>               1           2  21  

#>  3     6 <tibble [7 × 10]>               1           3  19.2

#>  4     6 <tibble [7 × 10]>               1           4  18.1

#>  5     6 <tibble [7 × 10]>               2           1  19.6

#>  6     6 <tibble [7 × 10]>               2           2  19.4

#>  7     6 <tibble [7 × 10]>               2           3  19.6

#>  8     6 <tibble [7 × 10]>               2           4  20.4

#>  9     6 <tibble [7 × 10]>               3           1  20.1

#> 10     6 <tibble [7 × 10]>               3           2  18.9

#> # ... with 230 more rows

^{Created on 2018-11-19 by the reprex package (v0.2.1)}

asked Nov 19 at 14:18

Philippe Massicotte

1979

1

Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36

add a comment |

up vote
3
down vote

favorite

library(tidyverse)



mtcars %>% 

  group_by(cyl) %>% 

  nest() %>% 

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>% 

  mutate(res = map2_dbl(data, n_random_sample, function(data, n) {



    data %>%

      sample_n(n, replace = TRUE) %>%

      summarise(mean_mpg = mean(mpg)) %>%

      pull(mean_mpg)



  }))

#> # A tibble: 240 x 5

#>      cyl data              n_random_sample n_replicate   res

#>    <dbl> <list>                      <int>       <int> <dbl>

#>  1     6 <tibble [7 × 10]>               1           1  17.8

#>  2     6 <tibble [7 × 10]>               1           2  21  

#>  3     6 <tibble [7 × 10]>               1           3  19.2

#>  4     6 <tibble [7 × 10]>               1           4  18.1

#>  5     6 <tibble [7 × 10]>               2           1  19.6

#>  6     6 <tibble [7 × 10]>               2           2  19.4

#>  7     6 <tibble [7 × 10]>               2           3  19.6

#>  8     6 <tibble [7 × 10]>               2           4  20.4

#>  9     6 <tibble [7 × 10]>               3           1  20.1

#> 10     6 <tibble [7 × 10]>               3           2  18.9

#> # ... with 230 more rows

^{Created on 2018-11-19 by the reprex package (v0.2.1)}

asked Nov 19 at 14:18

Philippe Massicotte

1979

1

Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36

add a comment |

up vote
3
down vote

favorite

library(tidyverse)



mtcars %>% 

  group_by(cyl) %>% 

  nest() %>% 

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>% 

  mutate(res = map2_dbl(data, n_random_sample, function(data, n) {



    data %>%

      sample_n(n, replace = TRUE) %>%

      summarise(mean_mpg = mean(mpg)) %>%

      pull(mean_mpg)



  }))

#> # A tibble: 240 x 5

#>      cyl data              n_random_sample n_replicate   res

#>    <dbl> <list>                      <int>       <int> <dbl>

#>  1     6 <tibble [7 × 10]>               1           1  17.8

#>  2     6 <tibble [7 × 10]>               1           2  21  

#>  3     6 <tibble [7 × 10]>               1           3  19.2

#>  4     6 <tibble [7 × 10]>               1           4  18.1

#>  5     6 <tibble [7 × 10]>               2           1  19.6

#>  6     6 <tibble [7 × 10]>               2           2  19.4

#>  7     6 <tibble [7 × 10]>               2           3  19.6

#>  8     6 <tibble [7 × 10]>               2           4  20.4

#>  9     6 <tibble [7 × 10]>               3           1  20.1

#> 10     6 <tibble [7 × 10]>               3           2  18.9

#> # ... with 230 more rows

^{Created on 2018-11-19 by the reprex package (v0.2.1)}

asked Nov 19 at 14:18

Philippe Massicotte

1979

library(tidyverse)



mtcars %>% 

  group_by(cyl) %>% 

  nest() %>% 

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>% 

  mutate(res = map2_dbl(data, n_random_sample, function(data, n) {



    data %>%

      sample_n(n, replace = TRUE) %>%

      summarise(mean_mpg = mean(mpg)) %>%

      pull(mean_mpg)



  }))

#> # A tibble: 240 x 5

#>      cyl data              n_random_sample n_replicate   res

#>    <dbl> <list>                      <int>       <int> <dbl>

#>  1     6 <tibble [7 × 10]>               1           1  17.8

#>  2     6 <tibble [7 × 10]>               1           2  21  

#>  3     6 <tibble [7 × 10]>               1           3  19.2

#>  4     6 <tibble [7 × 10]>               1           4  18.1

#>  5     6 <tibble [7 × 10]>               2           1  19.6

#>  6     6 <tibble [7 × 10]>               2           2  19.4

#>  7     6 <tibble [7 × 10]>               2           3  19.6

#>  8     6 <tibble [7 × 10]>               2           4  20.4

#>  9     6 <tibble [7 × 10]>               3           1  20.1

#> 10     6 <tibble [7 × 10]>               3           2  18.9

#> # ... with 230 more rows

^{Created on 2018-11-19 by the reprex package (v0.2.1)}

r data.table tidyverse

asked Nov 19 at 14:18

Philippe Massicotte

1979

asked Nov 19 at 14:18

Philippe Massicotte

1979

asked Nov 19 at 14:18

Philippe Massicotte

1979

asked Nov 19 at 14:18

Philippe Massicotte

1979

asked Nov 19 at 14:18

Philippe Massicotte

1979

1

Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36

add a comment |

1

Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36

Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36

add a comment |

2 Answers
2

active

oldest

votes

up vote
3
down vote

accepted

This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest to create the sub-datasets and store them as a list variable and then pick a sample using map:

library(tidyverse)



# create function to sample rows

f = function(c, n) {

  mtcars %>%

    filter(cyl == c) %>%

    sample_n(n, replace = TRUE) %>%

    summarise(mean_mpg = mean(mpg)) %>%

    pull(mean_mpg)

}



# vectorise function

f = Vectorize(f)



# set seed for reproducibility

set.seed(11)



tbl_df(mtcars) %>%

  distinct(cyl) %>%

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%

  mutate(res = f(cyl, n_random_sample))



# # A tibble: 240 x 4

#     cyl n_random_sample n_replicate   res

#   <dbl>           <int>       <int> <dbl>

# 1     6               1           1  21  

# 2     6               1           2  21  

# 3     6               1           3  18.1

# 4     6               1           4  21  

# 5     6               2           1  20.4

# 6     6               2           2  21.2

# 7     6               2           3  20.4

# 8     6               2           4  19.6

# 9     6               3           1  18.4

#10     6               3           2  19.6

# # ... with 230 more rows

answered Nov 19 at 14:51

AntoniosK

12.1k1822

Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47

add a comment |

up vote
1
down vote

mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)

This will give you a list of tables of nrows=1:20, each 4 times.

You can follow up with this to name the elements of the list:

names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))

Result:

head(mm,5)

$`sample.1-1`

              mpg cyl disp  hp drat    wt qsec vs am gear carb

Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.9  1  1    5    2



$`sample.2-1`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6



$`sample.3-1`

             mpg cyl disp hp drat    wt  qsec vs am gear carb

Honda Civic 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2



$`sample.4-1`

               mpg cyl  disp hp drat    wt  qsec vs am gear carb

Toyota Corona 21.5   4 120.1 97  3.7 2.465 20.01  1  0    3    1



$`sample.1-2`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

Volvo 142E   21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

answered Nov 19 at 14:39

iod

2,9671619

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376574%2fefficient-random-sampling-in-r%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

2 Answers
2

active

oldest

votes

2 Answers
2

active

oldest

votes

up vote
3
down vote

accepted

library(tidyverse)



# create function to sample rows

f = function(c, n) {

  mtcars %>%

    filter(cyl == c) %>%

    sample_n(n, replace = TRUE) %>%

    summarise(mean_mpg = mean(mpg)) %>%

    pull(mean_mpg)

}



# vectorise function

f = Vectorize(f)



# set seed for reproducibility

set.seed(11)



tbl_df(mtcars) %>%

  distinct(cyl) %>%

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%

  mutate(res = f(cyl, n_random_sample))



# # A tibble: 240 x 4

#     cyl n_random_sample n_replicate   res

#   <dbl>           <int>       <int> <dbl>

# 1     6               1           1  21  

# 2     6               1           2  21  

# 3     6               1           3  18.1

# 4     6               1           4  21  

# 5     6               2           1  20.4

# 6     6               2           2  21.2

# 7     6               2           3  20.4

# 8     6               2           4  19.6

# 9     6               3           1  18.4

#10     6               3           2  19.6

# # ... with 230 more rows

answered Nov 19 at 14:51

AntoniosK

12.1k1822

Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47

add a comment |

up vote
3
down vote

accepted

library(tidyverse)



# create function to sample rows

f = function(c, n) {

  mtcars %>%

    filter(cyl == c) %>%

    sample_n(n, replace = TRUE) %>%

    summarise(mean_mpg = mean(mpg)) %>%

    pull(mean_mpg)

}



# vectorise function

f = Vectorize(f)



# set seed for reproducibility

set.seed(11)



tbl_df(mtcars) %>%

  distinct(cyl) %>%

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%

  mutate(res = f(cyl, n_random_sample))



# # A tibble: 240 x 4

#     cyl n_random_sample n_replicate   res

#   <dbl>           <int>       <int> <dbl>

# 1     6               1           1  21  

# 2     6               1           2  21  

# 3     6               1           3  18.1

# 4     6               1           4  21  

# 5     6               2           1  20.4

# 6     6               2           2  21.2

# 7     6               2           3  20.4

# 8     6               2           4  19.6

# 9     6               3           1  18.4

#10     6               3           2  19.6

# # ... with 230 more rows

answered Nov 19 at 14:51

AntoniosK

12.1k1822

Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47

add a comment |

up vote
3
down vote

accepted

library(tidyverse)



# create function to sample rows

f = function(c, n) {

  mtcars %>%

    filter(cyl == c) %>%

    sample_n(n, replace = TRUE) %>%

    summarise(mean_mpg = mean(mpg)) %>%

    pull(mean_mpg)

}



# vectorise function

f = Vectorize(f)



# set seed for reproducibility

set.seed(11)



tbl_df(mtcars) %>%

  distinct(cyl) %>%

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%

  mutate(res = f(cyl, n_random_sample))



# # A tibble: 240 x 4

#     cyl n_random_sample n_replicate   res

#   <dbl>           <int>       <int> <dbl>

# 1     6               1           1  21  

# 2     6               1           2  21  

# 3     6               1           3  18.1

# 4     6               1           4  21  

# 5     6               2           1  20.4

# 6     6               2           2  21.2

# 7     6               2           3  20.4

# 8     6               2           4  19.6

# 9     6               3           1  18.4

#10     6               3           2  19.6

# # ... with 230 more rows

answered Nov 19 at 14:51

AntoniosK

12.1k1822

library(tidyverse)



# create function to sample rows

f = function(c, n) {

  mtcars %>%

    filter(cyl == c) %>%

    sample_n(n, replace = TRUE) %>%

    summarise(mean_mpg = mean(mpg)) %>%

    pull(mean_mpg)

}



# vectorise function

f = Vectorize(f)



# set seed for reproducibility

set.seed(11)



tbl_df(mtcars) %>%

  distinct(cyl) %>%

  crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%

  mutate(res = f(cyl, n_random_sample))



# # A tibble: 240 x 4

#     cyl n_random_sample n_replicate   res

#   <dbl>           <int>       <int> <dbl>

# 1     6               1           1  21  

# 2     6               1           2  21  

# 3     6               1           3  18.1

# 4     6               1           4  21  

# 5     6               2           1  20.4

# 6     6               2           2  21.2

# 7     6               2           3  20.4

# 8     6               2           4  19.6

# 9     6               3           1  18.4

#10     6               3           2  19.6

# # ... with 230 more rows

answered Nov 19 at 14:51

AntoniosK

12.1k1822

answered Nov 19 at 14:51

AntoniosK

12.1k1822

answered Nov 19 at 14:51

AntoniosK

12.1k1822

answered Nov 19 at 14:51

AntoniosK

12.1k1822

Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47

add a comment |

Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47

Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47

add a comment |

up vote
1
down vote

mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)

This will give you a list of tables of nrows=1:20, each 4 times.

You can follow up with this to name the elements of the list:

names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))

Result:

head(mm,5)

$`sample.1-1`

              mpg cyl disp  hp drat    wt qsec vs am gear carb

Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.9  1  1    5    2



$`sample.2-1`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6



$`sample.3-1`

             mpg cyl disp hp drat    wt  qsec vs am gear carb

Honda Civic 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2



$`sample.4-1`

               mpg cyl  disp hp drat    wt  qsec vs am gear carb

Toyota Corona 21.5   4 120.1 97  3.7 2.465 20.01  1  0    3    1



$`sample.1-2`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

Volvo 142E   21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

answered Nov 19 at 14:39

iod

2,9671619

add a comment |

up vote
1
down vote

mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)

This will give you a list of tables of nrows=1:20, each 4 times.

You can follow up with this to name the elements of the list:

names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))

Result:

head(mm,5)

$`sample.1-1`

              mpg cyl disp  hp drat    wt qsec vs am gear carb

Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.9  1  1    5    2



$`sample.2-1`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6



$`sample.3-1`

             mpg cyl disp hp drat    wt  qsec vs am gear carb

Honda Civic 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2



$`sample.4-1`

               mpg cyl  disp hp drat    wt  qsec vs am gear carb

Toyota Corona 21.5   4 120.1 97  3.7 2.465 20.01  1  0    3    1



$`sample.1-2`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

Volvo 142E   21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

answered Nov 19 at 14:39

iod

2,9671619

add a comment |

up vote
1
down vote

mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)

This will give you a list of tables of nrows=1:20, each 4 times.

You can follow up with this to name the elements of the list:

names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))

Result:

head(mm,5)

$`sample.1-1`

              mpg cyl disp  hp drat    wt qsec vs am gear carb

Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.9  1  1    5    2



$`sample.2-1`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6



$`sample.3-1`

             mpg cyl disp hp drat    wt  qsec vs am gear carb

Honda Civic 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2



$`sample.4-1`

               mpg cyl  disp hp drat    wt  qsec vs am gear carb

Toyota Corona 21.5   4 120.1 97  3.7 2.465 20.01  1  0    3    1



$`sample.1-2`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

Volvo 142E   21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

answered Nov 19 at 14:39

iod

2,9671619

mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)

This will give you a list of tables of nrows=1:20, each 4 times.

You can follow up with this to name the elements of the list:

names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))

Result:

head(mm,5)

$`sample.1-1`

              mpg cyl disp  hp drat    wt qsec vs am gear carb

Lotus Europa 30.4   4 95.1 113 3.77 1.513 16.9  1  1    5    2



$`sample.2-1`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6



$`sample.3-1`

             mpg cyl disp hp drat    wt  qsec vs am gear carb

Honda Civic 30.4   4 75.7 52 4.93 1.615 18.52  1  1    4    2



$`sample.4-1`

               mpg cyl  disp hp drat    wt  qsec vs am gear carb

Toyota Corona 21.5   4 120.1 97  3.7 2.465 20.01  1  0    3    1



$`sample.1-2`

              mpg cyl disp  hp drat   wt qsec vs am gear carb

Ferrari Dino 19.7   6  145 175 3.62 2.77 15.5  0  1    5    6

Volvo 142E   21.4   4  121 109 4.11 2.78 18.6  1  1    4    2

answered Nov 19 at 14:39

iod

2,9671619

answered Nov 19 at 14:39

iod

2,9671619

answered Nov 19 at 14:39

iod

2,9671619

answered Nov 19 at 14:39

iod

2,9671619

add a comment |

draft saved

draft discarded

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

DZytJiJ 8U6fhuFFkDA4S,7T 977rFCbOvnQaZ1HsU4jk1yzYVUt,qh3,I

搜尋此網誌

Ytukyg