Efficient random sampling in R
Multi tool use
up vote
3
down vote
favorite
From a data frame, I am trying randomly sample 1:20 observations where for
each number of observation I would like to replicate the process 4 times. I
came up with this working solution, but it is very slow since it is
involving coping many times a large data frame because of the crossing()
function. Anyone can point me toward a more efficient solution?
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest() %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = map2_dbl(data, n_random_sample, function(data, n) {
data %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)
}))
#> # A tibble: 240 x 5
#> cyl data n_random_sample n_replicate res
#> <dbl> <list> <int> <int> <dbl>
#> 1 6 <tibble [7 × 10]> 1 1 17.8
#> 2 6 <tibble [7 × 10]> 1 2 21
#> 3 6 <tibble [7 × 10]> 1 3 19.2
#> 4 6 <tibble [7 × 10]> 1 4 18.1
#> 5 6 <tibble [7 × 10]> 2 1 19.6
#> 6 6 <tibble [7 × 10]> 2 2 19.4
#> 7 6 <tibble [7 × 10]> 2 3 19.6
#> 8 6 <tibble [7 × 10]> 2 4 20.4
#> 9 6 <tibble [7 × 10]> 3 1 20.1
#> 10 6 <tibble [7 × 10]> 3 2 18.9
#> # ... with 230 more rows
Created on 2018-11-19 by the reprex package (v0.2.1)
r data.table tidyverse
add a comment |
up vote
3
down vote
favorite
From a data frame, I am trying randomly sample 1:20 observations where for
each number of observation I would like to replicate the process 4 times. I
came up with this working solution, but it is very slow since it is
involving coping many times a large data frame because of the crossing()
function. Anyone can point me toward a more efficient solution?
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest() %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = map2_dbl(data, n_random_sample, function(data, n) {
data %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)
}))
#> # A tibble: 240 x 5
#> cyl data n_random_sample n_replicate res
#> <dbl> <list> <int> <int> <dbl>
#> 1 6 <tibble [7 × 10]> 1 1 17.8
#> 2 6 <tibble [7 × 10]> 1 2 21
#> 3 6 <tibble [7 × 10]> 1 3 19.2
#> 4 6 <tibble [7 × 10]> 1 4 18.1
#> 5 6 <tibble [7 × 10]> 2 1 19.6
#> 6 6 <tibble [7 × 10]> 2 2 19.4
#> 7 6 <tibble [7 × 10]> 2 3 19.6
#> 8 6 <tibble [7 × 10]> 2 4 20.4
#> 9 6 <tibble [7 × 10]> 3 1 20.1
#> 10 6 <tibble [7 × 10]> 3 2 18.9
#> # ... with 230 more rows
Created on 2018-11-19 by the reprex package (v0.2.1)
r data.table tidyverse
1
Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36
add a comment |
up vote
3
down vote
favorite
up vote
3
down vote
favorite
From a data frame, I am trying randomly sample 1:20 observations where for
each number of observation I would like to replicate the process 4 times. I
came up with this working solution, but it is very slow since it is
involving coping many times a large data frame because of the crossing()
function. Anyone can point me toward a more efficient solution?
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest() %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = map2_dbl(data, n_random_sample, function(data, n) {
data %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)
}))
#> # A tibble: 240 x 5
#> cyl data n_random_sample n_replicate res
#> <dbl> <list> <int> <int> <dbl>
#> 1 6 <tibble [7 × 10]> 1 1 17.8
#> 2 6 <tibble [7 × 10]> 1 2 21
#> 3 6 <tibble [7 × 10]> 1 3 19.2
#> 4 6 <tibble [7 × 10]> 1 4 18.1
#> 5 6 <tibble [7 × 10]> 2 1 19.6
#> 6 6 <tibble [7 × 10]> 2 2 19.4
#> 7 6 <tibble [7 × 10]> 2 3 19.6
#> 8 6 <tibble [7 × 10]> 2 4 20.4
#> 9 6 <tibble [7 × 10]> 3 1 20.1
#> 10 6 <tibble [7 × 10]> 3 2 18.9
#> # ... with 230 more rows
Created on 2018-11-19 by the reprex package (v0.2.1)
r data.table tidyverse
From a data frame, I am trying randomly sample 1:20 observations where for
each number of observation I would like to replicate the process 4 times. I
came up with this working solution, but it is very slow since it is
involving coping many times a large data frame because of the crossing()
function. Anyone can point me toward a more efficient solution?
library(tidyverse)
mtcars %>%
group_by(cyl) %>%
nest() %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = map2_dbl(data, n_random_sample, function(data, n) {
data %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)
}))
#> # A tibble: 240 x 5
#> cyl data n_random_sample n_replicate res
#> <dbl> <list> <int> <int> <dbl>
#> 1 6 <tibble [7 × 10]> 1 1 17.8
#> 2 6 <tibble [7 × 10]> 1 2 21
#> 3 6 <tibble [7 × 10]> 1 3 19.2
#> 4 6 <tibble [7 × 10]> 1 4 18.1
#> 5 6 <tibble [7 × 10]> 2 1 19.6
#> 6 6 <tibble [7 × 10]> 2 2 19.4
#> 7 6 <tibble [7 × 10]> 2 3 19.6
#> 8 6 <tibble [7 × 10]> 2 4 20.4
#> 9 6 <tibble [7 × 10]> 3 1 20.1
#> 10 6 <tibble [7 × 10]> 3 2 18.9
#> # ... with 230 more rows
Created on 2018-11-19 by the reprex package (v0.2.1)
r data.table tidyverse
r data.table tidyverse
asked Nov 19 at 14:18
Philippe Massicotte
1979
1979
1
Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36
add a comment |
1
Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36
1
1
Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36
Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36
add a comment |
2 Answers
2
active
oldest
votes
up vote
3
down vote
accepted
This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest
to create the sub-datasets and store them as a list variable and then pick a sample using map
:
library(tidyverse)
# create function to sample rows
f = function(c, n) {
mtcars %>%
filter(cyl == c) %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)
}
# vectorise function
f = Vectorize(f)
# set seed for reproducibility
set.seed(11)
tbl_df(mtcars) %>%
distinct(cyl) %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = f(cyl, n_random_sample))
# # A tibble: 240 x 4
# cyl n_random_sample n_replicate res
# <dbl> <int> <int> <dbl>
# 1 6 1 1 21
# 2 6 1 2 21
# 3 6 1 3 18.1
# 4 6 1 4 21
# 5 6 2 1 20.4
# 6 6 2 2 21.2
# 7 6 2 3 20.4
# 8 6 2 4 19.6
# 9 6 3 1 18.4
#10 6 3 2 19.6
# # ... with 230 more rows
Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47
add a comment |
up vote
1
down vote
mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)
This will give you a list of tables of nrows=1:20, each 4 times.
You can follow up with this to name the elements of the list:
names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))
Result:
head(mm,5)
$`sample.1-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
$`sample.2-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
$`sample.3-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
$`sample.4-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
$`sample.1-2`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
add a comment |
2 Answers
2
active
oldest
votes
2 Answers
2
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
3
down vote
accepted
This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest
to create the sub-datasets and store them as a list variable and then pick a sample using map
:
library(tidyverse)
# create function to sample rows
f = function(c, n) {
mtcars %>%
filter(cyl == c) %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)
}
# vectorise function
f = Vectorize(f)
# set seed for reproducibility
set.seed(11)
tbl_df(mtcars) %>%
distinct(cyl) %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = f(cyl, n_random_sample))
# # A tibble: 240 x 4
# cyl n_random_sample n_replicate res
# <dbl> <int> <int> <dbl>
# 1 6 1 1 21
# 2 6 1 2 21
# 3 6 1 3 18.1
# 4 6 1 4 21
# 5 6 2 1 20.4
# 6 6 2 2 21.2
# 7 6 2 3 20.4
# 8 6 2 4 19.6
# 9 6 3 1 18.4
#10 6 3 2 19.6
# # ... with 230 more rows
Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47
add a comment |
up vote
3
down vote
accepted
This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest
to create the sub-datasets and store them as a list variable and then pick a sample using map
:
library(tidyverse)
# create function to sample rows
f = function(c, n) {
mtcars %>%
filter(cyl == c) %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)
}
# vectorise function
f = Vectorize(f)
# set seed for reproducibility
set.seed(11)
tbl_df(mtcars) %>%
distinct(cyl) %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = f(cyl, n_random_sample))
# # A tibble: 240 x 4
# cyl n_random_sample n_replicate res
# <dbl> <int> <int> <dbl>
# 1 6 1 1 21
# 2 6 1 2 21
# 3 6 1 3 18.1
# 4 6 1 4 21
# 5 6 2 1 20.4
# 6 6 2 2 21.2
# 7 6 2 3 20.4
# 8 6 2 4 19.6
# 9 6 3 1 18.4
#10 6 3 2 19.6
# # ... with 230 more rows
Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47
add a comment |
up vote
3
down vote
accepted
up vote
3
down vote
accepted
This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest
to create the sub-datasets and store them as a list variable and then pick a sample using map
:
library(tidyverse)
# create function to sample rows
f = function(c, n) {
mtcars %>%
filter(cyl == c) %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)
}
# vectorise function
f = Vectorize(f)
# set seed for reproducibility
set.seed(11)
tbl_df(mtcars) %>%
distinct(cyl) %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = f(cyl, n_random_sample))
# # A tibble: 240 x 4
# cyl n_random_sample n_replicate res
# <dbl> <int> <int> <dbl>
# 1 6 1 1 21
# 2 6 1 2 21
# 3 6 1 3 18.1
# 4 6 1 4 21
# 5 6 2 1 20.4
# 6 6 2 2 21.2
# 7 6 2 3 20.4
# 8 6 2 4 19.6
# 9 6 3 1 18.4
#10 6 3 2 19.6
# # ... with 230 more rows
This is an alternative solution, which subsets your original dataset and picks a sample of rows using a function, instead of using nest
to create the sub-datasets and store them as a list variable and then pick a sample using map
:
library(tidyverse)
# create function to sample rows
f = function(c, n) {
mtcars %>%
filter(cyl == c) %>%
sample_n(n, replace = TRUE) %>%
summarise(mean_mpg = mean(mpg)) %>%
pull(mean_mpg)
}
# vectorise function
f = Vectorize(f)
# set seed for reproducibility
set.seed(11)
tbl_df(mtcars) %>%
distinct(cyl) %>%
crossing(n_random_sample = 1:20, n_replicate = 1:4) %>%
mutate(res = f(cyl, n_random_sample))
# # A tibble: 240 x 4
# cyl n_random_sample n_replicate res
# <dbl> <int> <int> <dbl>
# 1 6 1 1 21
# 2 6 1 2 21
# 3 6 1 3 18.1
# 4 6 1 4 21
# 5 6 2 1 20.4
# 6 6 2 2 21.2
# 7 6 2 3 20.4
# 8 6 2 4 19.6
# 9 6 3 1 18.4
#10 6 3 2 19.6
# # ... with 230 more rows
answered Nov 19 at 14:51
AntoniosK
12.1k1822
12.1k1822
Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47
add a comment |
Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47
Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47
Works as intended, thank you.
– Philippe Massicotte
Nov 19 at 15:47
add a comment |
up vote
1
down vote
mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)
This will give you a list of tables of nrows=1:20, each 4 times.
You can follow up with this to name the elements of the list:
names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))
Result:
head(mm,5)
$`sample.1-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
$`sample.2-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
$`sample.3-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
$`sample.4-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
$`sample.1-2`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
add a comment |
up vote
1
down vote
mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)
This will give you a list of tables of nrows=1:20, each 4 times.
You can follow up with this to name the elements of the list:
names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))
Result:
head(mm,5)
$`sample.1-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
$`sample.2-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
$`sample.3-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
$`sample.4-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
$`sample.1-2`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
add a comment |
up vote
1
down vote
up vote
1
down vote
mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)
This will give you a list of tables of nrows=1:20, each 4 times.
You can follow up with this to name the elements of the list:
names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))
Result:
head(mm,5)
$`sample.1-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
$`sample.2-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
$`sample.3-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
$`sample.4-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
$`sample.1-2`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
mm<-lapply(rep(1:20, each=4), sample_n, tbl=mtcars)
This will give you a list of tables of nrows=1:20, each 4 times.
You can follow up with this to name the elements of the list:
names(mm)<-paste0("sample.",apply(expand.grid(1:4,1:20),1,paste,collapse="-"))
Result:
head(mm,5)
$`sample.1-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2
$`sample.2-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
$`sample.3-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
$`sample.4-1`
mpg cyl disp hp drat wt qsec vs am gear carb
Toyota Corona 21.5 4 120.1 97 3.7 2.465 20.01 1 0 3 1
$`sample.1-2`
mpg cyl disp hp drat wt qsec vs am gear carb
Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 0 1 5 6
Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2
answered Nov 19 at 14:39
iod
2,9671619
2,9671619
add a comment |
add a comment |
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53376574%2fefficient-random-sampling-in-r%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
wsa YsQX b,L lEBe 0 x4sgcFKnF Y Hu9U
1
Just sample row-indices and use a subset of your data each time rather than copying the data.
– Gregor
Nov 19 at 14:36