Apply function to multiple groups using Rcpp and R function
up vote
0
down vote
favorite
I'm trying to apply a function to multiple groups/id's in r using the foreach
package. It's taking forever to run using parallel processing via %dopar%
, so I was wondering if it's possible to run the apply
or for loop portion in c++
via rcpp
or other packages to make it faster. I'm not familiar with c++
or other packages that can do this so I'm hoping to learn if this is possible. The sample code is below. My actual function is longer with over 20 inputs and takes even longer to run than what I'm posting
I appreciate the help.
EDIT:
I realized my initial question was vague so I'll try to do a better job. I have a table with time series data by group. Each group has > 10K rows. I have written a function in c++
via rcpp
that filters the table by group and applies a function. I would like to loop through the unique groups and combine the results like rbind
does using rcpp
so that it runs faster. See sample code below (my actual function is longer)
library(data.table)
library(inline)
library(Rcpp)
library(stringi)
library(Runuran)
# Fake data
DT <- data.table(Group = rep(do.call(paste0, Map(stri_rand_strings, n=10, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]'))), 180))
df <- DT[order(Group)][
, .(Month = seq(1, 180, 1),
Col1 = urnorm(180, mean = 500, sd = 1, lb = 5, ub = 1000),
Col2 = urnorm(180, mean = 1000, sd = 1, lb = 5, ub = 1000),
Col3 = urnorm(180, mean = 300, sd = 1, lb = 5, ub = 1000)),
by = Group
]
# Rcpp function
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
DataFrame testFunc(DataFrame df, StringVector ids, double var1, double var2) {
// Filter by group
using namespace std;
StringVector sub = df["Group"];
std::string level = Rcpp::as<std::string>(ids[0]);
Rcpp::LogicalVector ind(sub.size());
for (int i = 0; i < sub.size(); i++){
ind[i] = (sub[i] == level);
}
// Access the columns
CharacterVector Group = df["Group"];
DoubleVector Month = df["Month"];
DoubleVector Col1 = df["Col1"];
DoubleVector Col2 = df["Col2"];
DoubleVector Col3 = df["Col3"];
// Create calculations
DoubleVector Cola = Col1 * (var1 * var2);
DoubleVector Colb = Col2 * (var1 * var2);
DoubleVector Colc = Col3 * (var1 * var2);
DoubleVector Cold = (Cola + Colb + Colc);
// Result summary
std::string Group_ID = level;
double SumCol1 = sum(Col1);
double SumCol2 = sum(Col2);
double SumCol3 = sum(Col3);
double SumColAll = sum(Cold);
// return a new data frame
return DataFrame::create(_["Group_ID"]= Group_ID, _["SumCol1"]= SumCol1,
_["SumCol2"]= SumCol2, _["SumCol3"]= SumCol3, _["SumColAll"]= SumColAll);
}
# Test function
Rcpp::sourceCpp('sample.cpp')
testFunc(df, ids = "BFTHU1315C", var1 = 24, var2 = 76) # ideally I would like to loop through all groups (unique(df$Group))
# Group_ID SumCol1 SumCol2 SumCol3 SumColAll
# 1 BFTHU1315C 899994.6 1798561 540001.6 5907129174
Thanks in advance.
r for-loop foreach rcpp rcppparallel
add a comment |
up vote
0
down vote
favorite
I'm trying to apply a function to multiple groups/id's in r using the foreach
package. It's taking forever to run using parallel processing via %dopar%
, so I was wondering if it's possible to run the apply
or for loop portion in c++
via rcpp
or other packages to make it faster. I'm not familiar with c++
or other packages that can do this so I'm hoping to learn if this is possible. The sample code is below. My actual function is longer with over 20 inputs and takes even longer to run than what I'm posting
I appreciate the help.
EDIT:
I realized my initial question was vague so I'll try to do a better job. I have a table with time series data by group. Each group has > 10K rows. I have written a function in c++
via rcpp
that filters the table by group and applies a function. I would like to loop through the unique groups and combine the results like rbind
does using rcpp
so that it runs faster. See sample code below (my actual function is longer)
library(data.table)
library(inline)
library(Rcpp)
library(stringi)
library(Runuran)
# Fake data
DT <- data.table(Group = rep(do.call(paste0, Map(stri_rand_strings, n=10, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]'))), 180))
df <- DT[order(Group)][
, .(Month = seq(1, 180, 1),
Col1 = urnorm(180, mean = 500, sd = 1, lb = 5, ub = 1000),
Col2 = urnorm(180, mean = 1000, sd = 1, lb = 5, ub = 1000),
Col3 = urnorm(180, mean = 300, sd = 1, lb = 5, ub = 1000)),
by = Group
]
# Rcpp function
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
DataFrame testFunc(DataFrame df, StringVector ids, double var1, double var2) {
// Filter by group
using namespace std;
StringVector sub = df["Group"];
std::string level = Rcpp::as<std::string>(ids[0]);
Rcpp::LogicalVector ind(sub.size());
for (int i = 0; i < sub.size(); i++){
ind[i] = (sub[i] == level);
}
// Access the columns
CharacterVector Group = df["Group"];
DoubleVector Month = df["Month"];
DoubleVector Col1 = df["Col1"];
DoubleVector Col2 = df["Col2"];
DoubleVector Col3 = df["Col3"];
// Create calculations
DoubleVector Cola = Col1 * (var1 * var2);
DoubleVector Colb = Col2 * (var1 * var2);
DoubleVector Colc = Col3 * (var1 * var2);
DoubleVector Cold = (Cola + Colb + Colc);
// Result summary
std::string Group_ID = level;
double SumCol1 = sum(Col1);
double SumCol2 = sum(Col2);
double SumCol3 = sum(Col3);
double SumColAll = sum(Cold);
// return a new data frame
return DataFrame::create(_["Group_ID"]= Group_ID, _["SumCol1"]= SumCol1,
_["SumCol2"]= SumCol2, _["SumCol3"]= SumCol3, _["SumColAll"]= SumColAll);
}
# Test function
Rcpp::sourceCpp('sample.cpp')
testFunc(df, ids = "BFTHU1315C", var1 = 24, var2 = 76) # ideally I would like to loop through all groups (unique(df$Group))
# Group_ID SumCol1 SumCol2 SumCol3 SumColAll
# 1 BFTHU1315C 899994.6 1798561 540001.6 5907129174
Thanks in advance.
r for-loop foreach rcpp rcppparallel
3
You write "I'm not familiar with c++ " and that is fine. Just don't expect a random stranger here to write the code for you. That's now how StackOverflow works, see the the tour here for more.
– Dirk Eddelbuettel
Nov 17 at 2:55
Thanks for the response @DirkEddelbuettel. I have edited my question with a workingc++
code
– user2566907
Nov 20 at 2:07
add a comment |
up vote
0
down vote
favorite
up vote
0
down vote
favorite
I'm trying to apply a function to multiple groups/id's in r using the foreach
package. It's taking forever to run using parallel processing via %dopar%
, so I was wondering if it's possible to run the apply
or for loop portion in c++
via rcpp
or other packages to make it faster. I'm not familiar with c++
or other packages that can do this so I'm hoping to learn if this is possible. The sample code is below. My actual function is longer with over 20 inputs and takes even longer to run than what I'm posting
I appreciate the help.
EDIT:
I realized my initial question was vague so I'll try to do a better job. I have a table with time series data by group. Each group has > 10K rows. I have written a function in c++
via rcpp
that filters the table by group and applies a function. I would like to loop through the unique groups and combine the results like rbind
does using rcpp
so that it runs faster. See sample code below (my actual function is longer)
library(data.table)
library(inline)
library(Rcpp)
library(stringi)
library(Runuran)
# Fake data
DT <- data.table(Group = rep(do.call(paste0, Map(stri_rand_strings, n=10, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]'))), 180))
df <- DT[order(Group)][
, .(Month = seq(1, 180, 1),
Col1 = urnorm(180, mean = 500, sd = 1, lb = 5, ub = 1000),
Col2 = urnorm(180, mean = 1000, sd = 1, lb = 5, ub = 1000),
Col3 = urnorm(180, mean = 300, sd = 1, lb = 5, ub = 1000)),
by = Group
]
# Rcpp function
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
DataFrame testFunc(DataFrame df, StringVector ids, double var1, double var2) {
// Filter by group
using namespace std;
StringVector sub = df["Group"];
std::string level = Rcpp::as<std::string>(ids[0]);
Rcpp::LogicalVector ind(sub.size());
for (int i = 0; i < sub.size(); i++){
ind[i] = (sub[i] == level);
}
// Access the columns
CharacterVector Group = df["Group"];
DoubleVector Month = df["Month"];
DoubleVector Col1 = df["Col1"];
DoubleVector Col2 = df["Col2"];
DoubleVector Col3 = df["Col3"];
// Create calculations
DoubleVector Cola = Col1 * (var1 * var2);
DoubleVector Colb = Col2 * (var1 * var2);
DoubleVector Colc = Col3 * (var1 * var2);
DoubleVector Cold = (Cola + Colb + Colc);
// Result summary
std::string Group_ID = level;
double SumCol1 = sum(Col1);
double SumCol2 = sum(Col2);
double SumCol3 = sum(Col3);
double SumColAll = sum(Cold);
// return a new data frame
return DataFrame::create(_["Group_ID"]= Group_ID, _["SumCol1"]= SumCol1,
_["SumCol2"]= SumCol2, _["SumCol3"]= SumCol3, _["SumColAll"]= SumColAll);
}
# Test function
Rcpp::sourceCpp('sample.cpp')
testFunc(df, ids = "BFTHU1315C", var1 = 24, var2 = 76) # ideally I would like to loop through all groups (unique(df$Group))
# Group_ID SumCol1 SumCol2 SumCol3 SumColAll
# 1 BFTHU1315C 899994.6 1798561 540001.6 5907129174
Thanks in advance.
r for-loop foreach rcpp rcppparallel
I'm trying to apply a function to multiple groups/id's in r using the foreach
package. It's taking forever to run using parallel processing via %dopar%
, so I was wondering if it's possible to run the apply
or for loop portion in c++
via rcpp
or other packages to make it faster. I'm not familiar with c++
or other packages that can do this so I'm hoping to learn if this is possible. The sample code is below. My actual function is longer with over 20 inputs and takes even longer to run than what I'm posting
I appreciate the help.
EDIT:
I realized my initial question was vague so I'll try to do a better job. I have a table with time series data by group. Each group has > 10K rows. I have written a function in c++
via rcpp
that filters the table by group and applies a function. I would like to loop through the unique groups and combine the results like rbind
does using rcpp
so that it runs faster. See sample code below (my actual function is longer)
library(data.table)
library(inline)
library(Rcpp)
library(stringi)
library(Runuran)
# Fake data
DT <- data.table(Group = rep(do.call(paste0, Map(stri_rand_strings, n=10, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]'))), 180))
df <- DT[order(Group)][
, .(Month = seq(1, 180, 1),
Col1 = urnorm(180, mean = 500, sd = 1, lb = 5, ub = 1000),
Col2 = urnorm(180, mean = 1000, sd = 1, lb = 5, ub = 1000),
Col3 = urnorm(180, mean = 300, sd = 1, lb = 5, ub = 1000)),
by = Group
]
# Rcpp function
#include <Rcpp.h>
using namespace Rcpp;
// [[Rcpp::plugins(cpp11)]]
// [[Rcpp::export]]
DataFrame testFunc(DataFrame df, StringVector ids, double var1, double var2) {
// Filter by group
using namespace std;
StringVector sub = df["Group"];
std::string level = Rcpp::as<std::string>(ids[0]);
Rcpp::LogicalVector ind(sub.size());
for (int i = 0; i < sub.size(); i++){
ind[i] = (sub[i] == level);
}
// Access the columns
CharacterVector Group = df["Group"];
DoubleVector Month = df["Month"];
DoubleVector Col1 = df["Col1"];
DoubleVector Col2 = df["Col2"];
DoubleVector Col3 = df["Col3"];
// Create calculations
DoubleVector Cola = Col1 * (var1 * var2);
DoubleVector Colb = Col2 * (var1 * var2);
DoubleVector Colc = Col3 * (var1 * var2);
DoubleVector Cold = (Cola + Colb + Colc);
// Result summary
std::string Group_ID = level;
double SumCol1 = sum(Col1);
double SumCol2 = sum(Col2);
double SumCol3 = sum(Col3);
double SumColAll = sum(Cold);
// return a new data frame
return DataFrame::create(_["Group_ID"]= Group_ID, _["SumCol1"]= SumCol1,
_["SumCol2"]= SumCol2, _["SumCol3"]= SumCol3, _["SumColAll"]= SumColAll);
}
# Test function
Rcpp::sourceCpp('sample.cpp')
testFunc(df, ids = "BFTHU1315C", var1 = 24, var2 = 76) # ideally I would like to loop through all groups (unique(df$Group))
# Group_ID SumCol1 SumCol2 SumCol3 SumColAll
# 1 BFTHU1315C 899994.6 1798561 540001.6 5907129174
Thanks in advance.
r for-loop foreach rcpp rcppparallel
r for-loop foreach rcpp rcppparallel
edited Nov 20 at 5:08
asked Nov 17 at 2:32
user2566907
326
326
3
You write "I'm not familiar with c++ " and that is fine. Just don't expect a random stranger here to write the code for you. That's now how StackOverflow works, see the the tour here for more.
– Dirk Eddelbuettel
Nov 17 at 2:55
Thanks for the response @DirkEddelbuettel. I have edited my question with a workingc++
code
– user2566907
Nov 20 at 2:07
add a comment |
3
You write "I'm not familiar with c++ " and that is fine. Just don't expect a random stranger here to write the code for you. That's now how StackOverflow works, see the the tour here for more.
– Dirk Eddelbuettel
Nov 17 at 2:55
Thanks for the response @DirkEddelbuettel. I have edited my question with a workingc++
code
– user2566907
Nov 20 at 2:07
3
3
You write "I'm not familiar with c++ " and that is fine. Just don't expect a random stranger here to write the code for you. That's now how StackOverflow works, see the the tour here for more.
– Dirk Eddelbuettel
Nov 17 at 2:55
You write "I'm not familiar with c++ " and that is fine. Just don't expect a random stranger here to write the code for you. That's now how StackOverflow works, see the the tour here for more.
– Dirk Eddelbuettel
Nov 17 at 2:55
Thanks for the response @DirkEddelbuettel. I have edited my question with a working
c++
code– user2566907
Nov 20 at 2:07
Thanks for the response @DirkEddelbuettel. I have edited my question with a working
c++
code– user2566907
Nov 20 at 2:07
add a comment |
1 Answer
1
active
oldest
votes
up vote
2
down vote
I would suggest to rethink our approach. Your test data set, which I assume is comparable to your real data set, has 3e8 rows. I am estimating about 10 GB of data. You seem to do the following with this data:
- Determine the list of unique IDs (about 5e5)
- Create one task per unique ID
- Each of these tasks gets the full data set and filters out all data that does not belong to the ID in question
- Each of these tasks adds some additional columns that do not depend on the ID
- Each of the tasks does a
group_b(ID)
, but there is only one ID left in the data set - Each of the tasks calculates some simple means
To me this appears very inefficient w.r.t. memory usage. Generally speaking for problems like this you would want "shared memory parallelism", but foreach
gives you only "process parallelism". The downside of process parallelism is that it increases the memory cost.
In addition, you are throwing away all the grouping and aggregation code that exists in base R / dplyr / data.table / SQL engines / ... It is very unlikely that you or any one reading your question here would be able to improve on these existing code bases.
My suggestions:
- Forget about "process parallelism" (for now)
- If you have sufficient RAM, try with a simple
dplyr
pipe withmutate
/group_by
/summarize
. - If that is not fast enough, learn how aggregation works with
data.table
, which is known to be faster and offers "shared memory paralleism" via OpenMP. - If your computer does not have enough memory and is swapping, then look into possibilities for out-of-memory computation. Personally I would use a (embedded) database.
To make this more explicit. Here a data.table
only solution:
library(data.table)
library(stringi)
# Fake data
set.seed(42)
var1 <- 24
var2 <- 76
DT <- data.table(Group = rep(do.call(paste0, Map(stri_rand_strings, n=10, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]'))), 180))
setkey(df, Group)
df <- DT[order(Group)][
, .(Month = seq(1, 180, 1),
Col1 = rnorm(180, mean = 500, sd = 1),
Col2 = rnorm(180, mean = 1000, sd = 1),
Col3 = rnorm(180, mean = 300, sd = 1)),
by = Group
][, c("Cola", "Colb", "Colc") := .(Col1 * (var1 * var2),
Col2 * (var1 * var2),
Col3 * (var1 * var2))
][, Cold := Cola + Colb + Colc]
# aggregagation
df[, .(SumCol1 = sum(Col1),
SumCol2 = sum(Col2),
SumCol3 = sum(Col3),
SumColAll = sum(Cold)), by = Group]
I am adding the computed columns by reference. The aggregation step uses the grouping functionality provided by data.table
. In case your aggregation is more complicated, you can also use a function:
# aggregation function
mySum <- function(Col1, Col2, Col3, Cold) {
list(SumCol1 = sum(Col1),
SumCol2 = sum(Col2),
SumCol3 = sum(Col3),
SumColAll = sum(Cold))
}
df[, mySum(Col1, Col2, Col3, Cold), by = Group]
And if the aggregation might be faster when using C++ (not the case for things like sum
!), you can even use that:
# aggregation function in C++
Rcpp::cppFunction('
Rcpp::List mySum(Rcpp::NumericVector Col1,
Rcpp::NumericVector Col2,
Rcpp::NumericVector Col3,
Rcpp::NumericVector Cold) {
double SumCol1 = Rcpp::sum(Col1);
double SumCol2 = Rcpp::sum(Col2);
double SumCol3 = Rcpp::sum(Col3);
double SumColAll = Rcpp::sum(Cold);
return Rcpp::List::create(Rcpp::Named("SumCol1") = SumCol1,
Rcpp::Named("SumCol2") = SumCol2,
Rcpp::Named("SumCol3") = SumCol3,
Rcpp::Named("SumColAll") = SumColAll);
}
')
df[, mySum(Col1, Col2, Col3, Cold), by = Group]
In all these examples the groping and looping is left to data.table
, since you won't gain anything by doing this yourself.
Thanks for the response. My initial question was vague, So I've edited it and explained what I'm trying to achieve.
– user2566907
Nov 20 at 2:09
@user2566907 Your edit did not make the question clearer to me. Let's hope that my edit made my answer clearer to you.
– Ralf Stubner
Nov 20 at 10:32
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53347666%2fapply-function-to-multiple-groups-using-rcpp-and-r-function%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
up vote
2
down vote
I would suggest to rethink our approach. Your test data set, which I assume is comparable to your real data set, has 3e8 rows. I am estimating about 10 GB of data. You seem to do the following with this data:
- Determine the list of unique IDs (about 5e5)
- Create one task per unique ID
- Each of these tasks gets the full data set and filters out all data that does not belong to the ID in question
- Each of these tasks adds some additional columns that do not depend on the ID
- Each of the tasks does a
group_b(ID)
, but there is only one ID left in the data set - Each of the tasks calculates some simple means
To me this appears very inefficient w.r.t. memory usage. Generally speaking for problems like this you would want "shared memory parallelism", but foreach
gives you only "process parallelism". The downside of process parallelism is that it increases the memory cost.
In addition, you are throwing away all the grouping and aggregation code that exists in base R / dplyr / data.table / SQL engines / ... It is very unlikely that you or any one reading your question here would be able to improve on these existing code bases.
My suggestions:
- Forget about "process parallelism" (for now)
- If you have sufficient RAM, try with a simple
dplyr
pipe withmutate
/group_by
/summarize
. - If that is not fast enough, learn how aggregation works with
data.table
, which is known to be faster and offers "shared memory paralleism" via OpenMP. - If your computer does not have enough memory and is swapping, then look into possibilities for out-of-memory computation. Personally I would use a (embedded) database.
To make this more explicit. Here a data.table
only solution:
library(data.table)
library(stringi)
# Fake data
set.seed(42)
var1 <- 24
var2 <- 76
DT <- data.table(Group = rep(do.call(paste0, Map(stri_rand_strings, n=10, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]'))), 180))
setkey(df, Group)
df <- DT[order(Group)][
, .(Month = seq(1, 180, 1),
Col1 = rnorm(180, mean = 500, sd = 1),
Col2 = rnorm(180, mean = 1000, sd = 1),
Col3 = rnorm(180, mean = 300, sd = 1)),
by = Group
][, c("Cola", "Colb", "Colc") := .(Col1 * (var1 * var2),
Col2 * (var1 * var2),
Col3 * (var1 * var2))
][, Cold := Cola + Colb + Colc]
# aggregagation
df[, .(SumCol1 = sum(Col1),
SumCol2 = sum(Col2),
SumCol3 = sum(Col3),
SumColAll = sum(Cold)), by = Group]
I am adding the computed columns by reference. The aggregation step uses the grouping functionality provided by data.table
. In case your aggregation is more complicated, you can also use a function:
# aggregation function
mySum <- function(Col1, Col2, Col3, Cold) {
list(SumCol1 = sum(Col1),
SumCol2 = sum(Col2),
SumCol3 = sum(Col3),
SumColAll = sum(Cold))
}
df[, mySum(Col1, Col2, Col3, Cold), by = Group]
And if the aggregation might be faster when using C++ (not the case for things like sum
!), you can even use that:
# aggregation function in C++
Rcpp::cppFunction('
Rcpp::List mySum(Rcpp::NumericVector Col1,
Rcpp::NumericVector Col2,
Rcpp::NumericVector Col3,
Rcpp::NumericVector Cold) {
double SumCol1 = Rcpp::sum(Col1);
double SumCol2 = Rcpp::sum(Col2);
double SumCol3 = Rcpp::sum(Col3);
double SumColAll = Rcpp::sum(Cold);
return Rcpp::List::create(Rcpp::Named("SumCol1") = SumCol1,
Rcpp::Named("SumCol2") = SumCol2,
Rcpp::Named("SumCol3") = SumCol3,
Rcpp::Named("SumColAll") = SumColAll);
}
')
df[, mySum(Col1, Col2, Col3, Cold), by = Group]
In all these examples the groping and looping is left to data.table
, since you won't gain anything by doing this yourself.
Thanks for the response. My initial question was vague, So I've edited it and explained what I'm trying to achieve.
– user2566907
Nov 20 at 2:09
@user2566907 Your edit did not make the question clearer to me. Let's hope that my edit made my answer clearer to you.
– Ralf Stubner
Nov 20 at 10:32
add a comment |
up vote
2
down vote
I would suggest to rethink our approach. Your test data set, which I assume is comparable to your real data set, has 3e8 rows. I am estimating about 10 GB of data. You seem to do the following with this data:
- Determine the list of unique IDs (about 5e5)
- Create one task per unique ID
- Each of these tasks gets the full data set and filters out all data that does not belong to the ID in question
- Each of these tasks adds some additional columns that do not depend on the ID
- Each of the tasks does a
group_b(ID)
, but there is only one ID left in the data set - Each of the tasks calculates some simple means
To me this appears very inefficient w.r.t. memory usage. Generally speaking for problems like this you would want "shared memory parallelism", but foreach
gives you only "process parallelism". The downside of process parallelism is that it increases the memory cost.
In addition, you are throwing away all the grouping and aggregation code that exists in base R / dplyr / data.table / SQL engines / ... It is very unlikely that you or any one reading your question here would be able to improve on these existing code bases.
My suggestions:
- Forget about "process parallelism" (for now)
- If you have sufficient RAM, try with a simple
dplyr
pipe withmutate
/group_by
/summarize
. - If that is not fast enough, learn how aggregation works with
data.table
, which is known to be faster and offers "shared memory paralleism" via OpenMP. - If your computer does not have enough memory and is swapping, then look into possibilities for out-of-memory computation. Personally I would use a (embedded) database.
To make this more explicit. Here a data.table
only solution:
library(data.table)
library(stringi)
# Fake data
set.seed(42)
var1 <- 24
var2 <- 76
DT <- data.table(Group = rep(do.call(paste0, Map(stri_rand_strings, n=10, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]'))), 180))
setkey(df, Group)
df <- DT[order(Group)][
, .(Month = seq(1, 180, 1),
Col1 = rnorm(180, mean = 500, sd = 1),
Col2 = rnorm(180, mean = 1000, sd = 1),
Col3 = rnorm(180, mean = 300, sd = 1)),
by = Group
][, c("Cola", "Colb", "Colc") := .(Col1 * (var1 * var2),
Col2 * (var1 * var2),
Col3 * (var1 * var2))
][, Cold := Cola + Colb + Colc]
# aggregagation
df[, .(SumCol1 = sum(Col1),
SumCol2 = sum(Col2),
SumCol3 = sum(Col3),
SumColAll = sum(Cold)), by = Group]
I am adding the computed columns by reference. The aggregation step uses the grouping functionality provided by data.table
. In case your aggregation is more complicated, you can also use a function:
# aggregation function
mySum <- function(Col1, Col2, Col3, Cold) {
list(SumCol1 = sum(Col1),
SumCol2 = sum(Col2),
SumCol3 = sum(Col3),
SumColAll = sum(Cold))
}
df[, mySum(Col1, Col2, Col3, Cold), by = Group]
And if the aggregation might be faster when using C++ (not the case for things like sum
!), you can even use that:
# aggregation function in C++
Rcpp::cppFunction('
Rcpp::List mySum(Rcpp::NumericVector Col1,
Rcpp::NumericVector Col2,
Rcpp::NumericVector Col3,
Rcpp::NumericVector Cold) {
double SumCol1 = Rcpp::sum(Col1);
double SumCol2 = Rcpp::sum(Col2);
double SumCol3 = Rcpp::sum(Col3);
double SumColAll = Rcpp::sum(Cold);
return Rcpp::List::create(Rcpp::Named("SumCol1") = SumCol1,
Rcpp::Named("SumCol2") = SumCol2,
Rcpp::Named("SumCol3") = SumCol3,
Rcpp::Named("SumColAll") = SumColAll);
}
')
df[, mySum(Col1, Col2, Col3, Cold), by = Group]
In all these examples the groping and looping is left to data.table
, since you won't gain anything by doing this yourself.
Thanks for the response. My initial question was vague, So I've edited it and explained what I'm trying to achieve.
– user2566907
Nov 20 at 2:09
@user2566907 Your edit did not make the question clearer to me. Let's hope that my edit made my answer clearer to you.
– Ralf Stubner
Nov 20 at 10:32
add a comment |
up vote
2
down vote
up vote
2
down vote
I would suggest to rethink our approach. Your test data set, which I assume is comparable to your real data set, has 3e8 rows. I am estimating about 10 GB of data. You seem to do the following with this data:
- Determine the list of unique IDs (about 5e5)
- Create one task per unique ID
- Each of these tasks gets the full data set and filters out all data that does not belong to the ID in question
- Each of these tasks adds some additional columns that do not depend on the ID
- Each of the tasks does a
group_b(ID)
, but there is only one ID left in the data set - Each of the tasks calculates some simple means
To me this appears very inefficient w.r.t. memory usage. Generally speaking for problems like this you would want "shared memory parallelism", but foreach
gives you only "process parallelism". The downside of process parallelism is that it increases the memory cost.
In addition, you are throwing away all the grouping and aggregation code that exists in base R / dplyr / data.table / SQL engines / ... It is very unlikely that you or any one reading your question here would be able to improve on these existing code bases.
My suggestions:
- Forget about "process parallelism" (for now)
- If you have sufficient RAM, try with a simple
dplyr
pipe withmutate
/group_by
/summarize
. - If that is not fast enough, learn how aggregation works with
data.table
, which is known to be faster and offers "shared memory paralleism" via OpenMP. - If your computer does not have enough memory and is swapping, then look into possibilities for out-of-memory computation. Personally I would use a (embedded) database.
To make this more explicit. Here a data.table
only solution:
library(data.table)
library(stringi)
# Fake data
set.seed(42)
var1 <- 24
var2 <- 76
DT <- data.table(Group = rep(do.call(paste0, Map(stri_rand_strings, n=10, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]'))), 180))
setkey(df, Group)
df <- DT[order(Group)][
, .(Month = seq(1, 180, 1),
Col1 = rnorm(180, mean = 500, sd = 1),
Col2 = rnorm(180, mean = 1000, sd = 1),
Col3 = rnorm(180, mean = 300, sd = 1)),
by = Group
][, c("Cola", "Colb", "Colc") := .(Col1 * (var1 * var2),
Col2 * (var1 * var2),
Col3 * (var1 * var2))
][, Cold := Cola + Colb + Colc]
# aggregagation
df[, .(SumCol1 = sum(Col1),
SumCol2 = sum(Col2),
SumCol3 = sum(Col3),
SumColAll = sum(Cold)), by = Group]
I am adding the computed columns by reference. The aggregation step uses the grouping functionality provided by data.table
. In case your aggregation is more complicated, you can also use a function:
# aggregation function
mySum <- function(Col1, Col2, Col3, Cold) {
list(SumCol1 = sum(Col1),
SumCol2 = sum(Col2),
SumCol3 = sum(Col3),
SumColAll = sum(Cold))
}
df[, mySum(Col1, Col2, Col3, Cold), by = Group]
And if the aggregation might be faster when using C++ (not the case for things like sum
!), you can even use that:
# aggregation function in C++
Rcpp::cppFunction('
Rcpp::List mySum(Rcpp::NumericVector Col1,
Rcpp::NumericVector Col2,
Rcpp::NumericVector Col3,
Rcpp::NumericVector Cold) {
double SumCol1 = Rcpp::sum(Col1);
double SumCol2 = Rcpp::sum(Col2);
double SumCol3 = Rcpp::sum(Col3);
double SumColAll = Rcpp::sum(Cold);
return Rcpp::List::create(Rcpp::Named("SumCol1") = SumCol1,
Rcpp::Named("SumCol2") = SumCol2,
Rcpp::Named("SumCol3") = SumCol3,
Rcpp::Named("SumColAll") = SumColAll);
}
')
df[, mySum(Col1, Col2, Col3, Cold), by = Group]
In all these examples the groping and looping is left to data.table
, since you won't gain anything by doing this yourself.
I would suggest to rethink our approach. Your test data set, which I assume is comparable to your real data set, has 3e8 rows. I am estimating about 10 GB of data. You seem to do the following with this data:
- Determine the list of unique IDs (about 5e5)
- Create one task per unique ID
- Each of these tasks gets the full data set and filters out all data that does not belong to the ID in question
- Each of these tasks adds some additional columns that do not depend on the ID
- Each of the tasks does a
group_b(ID)
, but there is only one ID left in the data set - Each of the tasks calculates some simple means
To me this appears very inefficient w.r.t. memory usage. Generally speaking for problems like this you would want "shared memory parallelism", but foreach
gives you only "process parallelism". The downside of process parallelism is that it increases the memory cost.
In addition, you are throwing away all the grouping and aggregation code that exists in base R / dplyr / data.table / SQL engines / ... It is very unlikely that you or any one reading your question here would be able to improve on these existing code bases.
My suggestions:
- Forget about "process parallelism" (for now)
- If you have sufficient RAM, try with a simple
dplyr
pipe withmutate
/group_by
/summarize
. - If that is not fast enough, learn how aggregation works with
data.table
, which is known to be faster and offers "shared memory paralleism" via OpenMP. - If your computer does not have enough memory and is swapping, then look into possibilities for out-of-memory computation. Personally I would use a (embedded) database.
To make this more explicit. Here a data.table
only solution:
library(data.table)
library(stringi)
# Fake data
set.seed(42)
var1 <- 24
var2 <- 76
DT <- data.table(Group = rep(do.call(paste0, Map(stri_rand_strings, n=10, length=c(5, 4, 1),
pattern = c('[A-Z]', '[0-9]', '[A-Z]'))), 180))
setkey(df, Group)
df <- DT[order(Group)][
, .(Month = seq(1, 180, 1),
Col1 = rnorm(180, mean = 500, sd = 1),
Col2 = rnorm(180, mean = 1000, sd = 1),
Col3 = rnorm(180, mean = 300, sd = 1)),
by = Group
][, c("Cola", "Colb", "Colc") := .(Col1 * (var1 * var2),
Col2 * (var1 * var2),
Col3 * (var1 * var2))
][, Cold := Cola + Colb + Colc]
# aggregagation
df[, .(SumCol1 = sum(Col1),
SumCol2 = sum(Col2),
SumCol3 = sum(Col3),
SumColAll = sum(Cold)), by = Group]
I am adding the computed columns by reference. The aggregation step uses the grouping functionality provided by data.table
. In case your aggregation is more complicated, you can also use a function:
# aggregation function
mySum <- function(Col1, Col2, Col3, Cold) {
list(SumCol1 = sum(Col1),
SumCol2 = sum(Col2),
SumCol3 = sum(Col3),
SumColAll = sum(Cold))
}
df[, mySum(Col1, Col2, Col3, Cold), by = Group]
And if the aggregation might be faster when using C++ (not the case for things like sum
!), you can even use that:
# aggregation function in C++
Rcpp::cppFunction('
Rcpp::List mySum(Rcpp::NumericVector Col1,
Rcpp::NumericVector Col2,
Rcpp::NumericVector Col3,
Rcpp::NumericVector Cold) {
double SumCol1 = Rcpp::sum(Col1);
double SumCol2 = Rcpp::sum(Col2);
double SumCol3 = Rcpp::sum(Col3);
double SumColAll = Rcpp::sum(Cold);
return Rcpp::List::create(Rcpp::Named("SumCol1") = SumCol1,
Rcpp::Named("SumCol2") = SumCol2,
Rcpp::Named("SumCol3") = SumCol3,
Rcpp::Named("SumColAll") = SumColAll);
}
')
df[, mySum(Col1, Col2, Col3, Cold), by = Group]
In all these examples the groping and looping is left to data.table
, since you won't gain anything by doing this yourself.
edited Nov 20 at 10:31
answered Nov 18 at 23:02
Ralf Stubner
13.6k21437
13.6k21437
Thanks for the response. My initial question was vague, So I've edited it and explained what I'm trying to achieve.
– user2566907
Nov 20 at 2:09
@user2566907 Your edit did not make the question clearer to me. Let's hope that my edit made my answer clearer to you.
– Ralf Stubner
Nov 20 at 10:32
add a comment |
Thanks for the response. My initial question was vague, So I've edited it and explained what I'm trying to achieve.
– user2566907
Nov 20 at 2:09
@user2566907 Your edit did not make the question clearer to me. Let's hope that my edit made my answer clearer to you.
– Ralf Stubner
Nov 20 at 10:32
Thanks for the response. My initial question was vague, So I've edited it and explained what I'm trying to achieve.
– user2566907
Nov 20 at 2:09
Thanks for the response. My initial question was vague, So I've edited it and explained what I'm trying to achieve.
– user2566907
Nov 20 at 2:09
@user2566907 Your edit did not make the question clearer to me. Let's hope that my edit made my answer clearer to you.
– Ralf Stubner
Nov 20 at 10:32
@user2566907 Your edit did not make the question clearer to me. Let's hope that my edit made my answer clearer to you.
– Ralf Stubner
Nov 20 at 10:32
add a comment |
Thanks for contributing an answer to Stack Overflow!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Some of your past answers have not been well-received, and you're in danger of being blocked from answering.
Please pay close attention to the following guidance:
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53347666%2fapply-function-to-multiple-groups-using-rcpp-and-r-function%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
3
You write "I'm not familiar with c++ " and that is fine. Just don't expect a random stranger here to write the code for you. That's now how StackOverflow works, see the the tour here for more.
– Dirk Eddelbuettel
Nov 17 at 2:55
Thanks for the response @DirkEddelbuettel. I have edited my question with a working
c++
code– user2566907
Nov 20 at 2:07