undefined columns selected error in R when trying to subset using sapply

I have been tearing my hair out over this for the last hour, the following code was working perfectly a couple of hours ago, and now I have no idea why it doesn't anymore. I have searched for other questions regarding the undefined columns selected error, but I think I have corrected for all of the info in those answers. I am sure there is some tiny thing I have overlooked or accidently left in, but I can't see it!

I have a data frame with both factor and numeric variables, I want to subset so that I keep all of the factor variables, and remove numeric variables whose columns have a mean < 0.1.

I found the following code on another question on stackoverflow, which slightly modified worked well on my test data (smaller sub-dataset I am using for testing before trying out code on a big 3GB object)

meanfunction01 <- function(x){

    if(is.numeric(x)){

        mean(x) > 0.1

      } else {

    TRUE}

}



#then apply function to data table

Zdata <- Data1[,sapply(Data1,  meanfunction01)]

I swear I was using this a few hours ago, then when i came back to it and tried to use it again it stopped working and now just returns the following error:

Error in `[.data.frame`(Data1, , sapply(Data1, meanfunction01)) : 

  undefined columns selected

I was trying to modify the function so that it would loop over multiple objects (I have 54 objects I want to apply it to, and didn't want to type them all manually), but I don't think I edited the original function, and now it has stopped working.

A brief str() of my data:

> str(Data1[1:10])

'data.frame':   11 obs. of  10 variables:

 $ Name               : Factor w/ 11688 levels "GTEX-1117F-0226-SM-5GZZ7",..: 8186 8242 8262 8270 8343 8388 8403 8621 8689 8709 ...

 $ SEX                : Factor w/ 2 levels "Female","Male": 1 2 2 1 1 2 2 1 2 1 ...

 $ AGE                : Factor w/ 6 levels "20-29","30-39",..: 4 4 1 3 3 1 3 3 3 2 ...

 $ CIRCUMSTANCES: Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...

 $ Tissue.x           : Factor w/ 53 levels "Adipose_Subcutaneous",..: 7 7 7 7 7 7 7 7 7 7 ...

 $ ENSG00000223972.4  : num  0 0.0701 0.0339 0.1149 0.0549 ...

 $ ENSG00000227232.4  : num  12.5 17.2 13.1 16 15.7 ...

 $ ENSG00000243485.2  : num  0.0717 0 0.1508 0 0.061 ...

 $ ENSG00000237613.2  : num  0 0.0654 0 0.0402 0.0768 ...

 $ ENSG00000268020.2  : num  0 0.0421 0.0611 0 0 ...

asked Nov 22 '18 at 16:05

Phil D

327

It will be very difficult to guess what the problem might be without a working example that demonstrates the issue. You may find some helpful tips in stackoverflow.com/questions/5963269/…

– Ista
Nov 22 '18 at 16:17

Ok, in trying to head() my data in order to create an example dataset for you, think I have narrowed down the problem. When I re-loaded the object into my environment, it seems to have categorised some of my columns as integers rather than numeric. When I run the code on the head([1:20]) subset it works fine, as the integer columns start appearing later on in the date around column 10000 or something. Now I am trying to figure out how to recatagorise these columns as numeric instead, which is a different problem entirely. Thanks anyway!

– Phil D
Nov 22 '18 at 16:52

Why just suspect that the data structure changes as you move along your multiple columns? Why not check how the columns are structured by running head() without restricting the range of columns? If it turns out that indeed some columns are integers rather than numeric then re-structure them using something along these lines: Data1[,4:33] <- lapply(Data1[,4:33], as.numeric)

– Chris Ruehlemann
Nov 22 '18 at 18:44

Well, mostly because using head() without restricting columns in a data frame with >50000 columns will be difficult to check manually! I have started restructuring using lapply though

– Phil D
Nov 26 '18 at 11:24

add a comment |

I have a data frame with both factor and numeric variables, I want to subset so that I keep all of the factor variables, and remove numeric variables whose columns have a mean < 0.1.

meanfunction01 <- function(x){

    if(is.numeric(x)){

        mean(x) > 0.1

      } else {

    TRUE}

}



#then apply function to data table

Zdata <- Data1[,sapply(Data1,  meanfunction01)]

I swear I was using this a few hours ago, then when i came back to it and tried to use it again it stopped working and now just returns the following error:

Error in `[.data.frame`(Data1, , sapply(Data1, meanfunction01)) : 

  undefined columns selected

A brief str() of my data:

> str(Data1[1:10])

'data.frame':   11 obs. of  10 variables:

 $ Name               : Factor w/ 11688 levels "GTEX-1117F-0226-SM-5GZZ7",..: 8186 8242 8262 8270 8343 8388 8403 8621 8689 8709 ...

 $ SEX                : Factor w/ 2 levels "Female","Male": 1 2 2 1 1 2 2 1 2 1 ...

 $ AGE                : Factor w/ 6 levels "20-29","30-39",..: 4 4 1 3 3 1 3 3 3 2 ...

 $ CIRCUMSTANCES: Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...

 $ Tissue.x           : Factor w/ 53 levels "Adipose_Subcutaneous",..: 7 7 7 7 7 7 7 7 7 7 ...

 $ ENSG00000223972.4  : num  0 0.0701 0.0339 0.1149 0.0549 ...

 $ ENSG00000227232.4  : num  12.5 17.2 13.1 16 15.7 ...

 $ ENSG00000243485.2  : num  0.0717 0 0.1508 0 0.061 ...

 $ ENSG00000237613.2  : num  0 0.0654 0 0.0402 0.0768 ...

 $ ENSG00000268020.2  : num  0 0.0421 0.0611 0 0 ...

asked Nov 22 '18 at 16:05

Phil D

327

It will be very difficult to guess what the problem might be without a working example that demonstrates the issue. You may find some helpful tips in stackoverflow.com/questions/5963269/…

– Ista
Nov 22 '18 at 16:17

Ok, in trying to head() my data in order to create an example dataset for you, think I have narrowed down the problem. When I re-loaded the object into my environment, it seems to have categorised some of my columns as integers rather than numeric. When I run the code on the head([1:20]) subset it works fine, as the integer columns start appearing later on in the date around column 10000 or something. Now I am trying to figure out how to recatagorise these columns as numeric instead, which is a different problem entirely. Thanks anyway!

– Phil D
Nov 22 '18 at 16:52

Why just suspect that the data structure changes as you move along your multiple columns? Why not check how the columns are structured by running head() without restricting the range of columns? If it turns out that indeed some columns are integers rather than numeric then re-structure them using something along these lines: Data1[,4:33] <- lapply(Data1[,4:33], as.numeric)

– Chris Ruehlemann
Nov 22 '18 at 18:44

Well, mostly because using head() without restricting columns in a data frame with >50000 columns will be difficult to check manually! I have started restructuring using lapply though

– Phil D
Nov 26 '18 at 11:24

add a comment |

I have a data frame with both factor and numeric variables, I want to subset so that I keep all of the factor variables, and remove numeric variables whose columns have a mean < 0.1.

meanfunction01 <- function(x){

    if(is.numeric(x)){

        mean(x) > 0.1

      } else {

    TRUE}

}



#then apply function to data table

Zdata <- Data1[,sapply(Data1,  meanfunction01)]

I swear I was using this a few hours ago, then when i came back to it and tried to use it again it stopped working and now just returns the following error:

Error in `[.data.frame`(Data1, , sapply(Data1, meanfunction01)) : 

  undefined columns selected

A brief str() of my data:

> str(Data1[1:10])

'data.frame':   11 obs. of  10 variables:

 $ Name               : Factor w/ 11688 levels "GTEX-1117F-0226-SM-5GZZ7",..: 8186 8242 8262 8270 8343 8388 8403 8621 8689 8709 ...

 $ SEX                : Factor w/ 2 levels "Female","Male": 1 2 2 1 1 2 2 1 2 1 ...

 $ AGE                : Factor w/ 6 levels "20-29","30-39",..: 4 4 1 3 3 1 3 3 3 2 ...

 $ CIRCUMSTANCES: Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...

 $ Tissue.x           : Factor w/ 53 levels "Adipose_Subcutaneous",..: 7 7 7 7 7 7 7 7 7 7 ...

 $ ENSG00000223972.4  : num  0 0.0701 0.0339 0.1149 0.0549 ...

 $ ENSG00000227232.4  : num  12.5 17.2 13.1 16 15.7 ...

 $ ENSG00000243485.2  : num  0.0717 0 0.1508 0 0.061 ...

 $ ENSG00000237613.2  : num  0 0.0654 0 0.0402 0.0768 ...

 $ ENSG00000268020.2  : num  0 0.0421 0.0611 0 0 ...

asked Nov 22 '18 at 16:05

Phil D

327

I have a data frame with both factor and numeric variables, I want to subset so that I keep all of the factor variables, and remove numeric variables whose columns have a mean < 0.1.

meanfunction01 <- function(x){

    if(is.numeric(x)){

        mean(x) > 0.1

      } else {

    TRUE}

}



#then apply function to data table

Zdata <- Data1[,sapply(Data1,  meanfunction01)]

I swear I was using this a few hours ago, then when i came back to it and tried to use it again it stopped working and now just returns the following error:

Error in `[.data.frame`(Data1, , sapply(Data1, meanfunction01)) : 

  undefined columns selected

A brief str() of my data:

> str(Data1[1:10])

'data.frame':   11 obs. of  10 variables:

 $ Name               : Factor w/ 11688 levels "GTEX-1117F-0226-SM-5GZZ7",..: 8186 8242 8262 8270 8343 8388 8403 8621 8689 8709 ...

 $ SEX                : Factor w/ 2 levels "Female","Male": 1 2 2 1 1 2 2 1 2 1 ...

 $ AGE                : Factor w/ 6 levels "20-29","30-39",..: 4 4 1 3 3 1 3 3 3 2 ...

 $ CIRCUMSTANCES: Factor w/ 5 levels "0","1","2","3",..: 1 1 1 1 1 1 1 1 1 1 ...

 $ Tissue.x           : Factor w/ 53 levels "Adipose_Subcutaneous",..: 7 7 7 7 7 7 7 7 7 7 ...

 $ ENSG00000223972.4  : num  0 0.0701 0.0339 0.1149 0.0549 ...

 $ ENSG00000227232.4  : num  12.5 17.2 13.1 16 15.7 ...

 $ ENSG00000243485.2  : num  0.0717 0 0.1508 0 0.061 ...

 $ ENSG00000237613.2  : num  0 0.0654 0 0.0402 0.0768 ...

 $ ENSG00000268020.2  : num  0 0.0421 0.0611 0 0 ...

asked Nov 22 '18 at 16:05

Phil D

327

asked Nov 22 '18 at 16:05

Phil D

327

asked Nov 22 '18 at 16:05

Phil D

327

asked Nov 22 '18 at 16:05

Phil D

327

asked Nov 22 '18 at 16:05

Phil D

327

It will be very difficult to guess what the problem might be without a working example that demonstrates the issue. You may find some helpful tips in stackoverflow.com/questions/5963269/…

– Ista
Nov 22 '18 at 16:17

Ok, in trying to head() my data in order to create an example dataset for you, think I have narrowed down the problem. When I re-loaded the object into my environment, it seems to have categorised some of my columns as integers rather than numeric. When I run the code on the head([1:20]) subset it works fine, as the integer columns start appearing later on in the date around column 10000 or something. Now I am trying to figure out how to recatagorise these columns as numeric instead, which is a different problem entirely. Thanks anyway!

– Phil D
Nov 22 '18 at 16:52

Why just suspect that the data structure changes as you move along your multiple columns? Why not check how the columns are structured by running head() without restricting the range of columns? If it turns out that indeed some columns are integers rather than numeric then re-structure them using something along these lines: Data1[,4:33] <- lapply(Data1[,4:33], as.numeric)

– Chris Ruehlemann
Nov 22 '18 at 18:44

Well, mostly because using head() without restricting columns in a data frame with >50000 columns will be difficult to check manually! I have started restructuring using lapply though

– Phil D
Nov 26 '18 at 11:24

add a comment |

It will be very difficult to guess what the problem might be without a working example that demonstrates the issue. You may find some helpful tips in stackoverflow.com/questions/5963269/…

– Ista
Nov 22 '18 at 16:17

Ok, in trying to head() my data in order to create an example dataset for you, think I have narrowed down the problem. When I re-loaded the object into my environment, it seems to have categorised some of my columns as integers rather than numeric. When I run the code on the head([1:20]) subset it works fine, as the integer columns start appearing later on in the date around column 10000 or something. Now I am trying to figure out how to recatagorise these columns as numeric instead, which is a different problem entirely. Thanks anyway!

– Phil D
Nov 22 '18 at 16:52

Why just suspect that the data structure changes as you move along your multiple columns? Why not check how the columns are structured by running head() without restricting the range of columns? If it turns out that indeed some columns are integers rather than numeric then re-structure them using something along these lines: Data1[,4:33] <- lapply(Data1[,4:33], as.numeric)

– Chris Ruehlemann
Nov 22 '18 at 18:44

Well, mostly because using head() without restricting columns in a data frame with >50000 columns will be difficult to check manually! I have started restructuring using lapply though

– Phil D
Nov 26 '18 at 11:24

It will be very difficult to guess what the problem might be without a working example that demonstrates the issue. You may find some helpful tips in stackoverflow.com/questions/5963269/…

– Ista
Nov 22 '18 at 16:17

Ok, in trying to head() my data in order to create an example dataset for you, think I have narrowed down the problem. When I re-loaded the object into my environment, it seems to have categorised some of my columns as integers rather than numeric. When I run the code on the head([1:20]) subset it works fine, as the integer columns start appearing later on in the date around column 10000 or something. Now I am trying to figure out how to recatagorise these columns as numeric instead, which is a different problem entirely. Thanks anyway!

– Phil D
Nov 22 '18 at 16:52

Why just suspect that the data structure changes as you move along your multiple columns? Why not check how the columns are structured by running head() without restricting the range of columns? If it turns out that indeed some columns are integers rather than numeric then re-structure them using something along these lines: Data1[,4:33] <- lapply(Data1[,4:33], as.numeric)

– Chris Ruehlemann
Nov 22 '18 at 18:44

Well, mostly because using head() without restricting columns in a data frame with >50000 columns will be difficult to check manually! I have started restructuring using lapply though

– Phil D
Nov 26 '18 at 11:24

add a comment |

1 Answer
1

active

oldest

votes

So if your only issue is changing the class of the integer variables in your data.frame but you have many columns (>10000) you may want to consider converting your data.frame into a data.table. Your code would then look like this:

library(data.table)

Data1<-data.table(Data1) #or if you have your data in csv document just use fread instead of read.csv which will automatically give you a data.table.

Then you just need to find the integer columns using this:

which(sapply(Data1,is.integer))

Putting it altogether using the data.table commands:

Data1[,which(sapply(Data1,is.integer)):=lapply(.SD,as.numeric),.SDcols=which(sapply(Data1,is.integer))]

Note you don't need to assign the above line of code into anything since data.table uses pointers which makes it much faster than data.frame or tibbles objects. So running the above line will update your Data1 object efficiently. The classes of the other non-integer columns (i.e., factors) will remain unchanged.

Please update if you have further questions but this should answer your comment. Best of luck!

answered Nov 23 '18 at 0:11

Jason Johnson

965

Thanks for the advice, but I figured out the solution (to this problem at least) myself in the end, and it wasn't the integers causing this problem (although they caused a different problem), it was simply the presence of NA values, which I didn't realise I could remove by using na.rm = TRUE inside the mean() function. Sorry for the very basic questions requiring very basic solutions, but I am still very new to coding and teaching myself!

– Phil D
Nov 26 '18 at 11:35

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
StackExchange.using("externalEditor", function () {
StackExchange.using("snippets", function () {
StackExchange.snippets.init();
});
});
}, "code-snippets");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "1"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: true,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: 10,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstackoverflow.com%2fquestions%2f53434684%2fundefined-columns-selected-error-in-r-when-trying-to-subset-using-sapply%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

library(data.table)

Data1<-data.table(Data1) #or if you have your data in csv document just use fread instead of read.csv which will automatically give you a data.table.

Then you just need to find the integer columns using this:

which(sapply(Data1,is.integer))

Putting it altogether using the data.table commands:

Data1[,which(sapply(Data1,is.integer)):=lapply(.SD,as.numeric),.SDcols=which(sapply(Data1,is.integer))]

Please update if you have further questions but this should answer your comment. Best of luck!

answered Nov 23 '18 at 0:11

Jason Johnson

965

Thanks for the advice, but I figured out the solution (to this problem at least) myself in the end, and it wasn't the integers causing this problem (although they caused a different problem), it was simply the presence of NA values, which I didn't realise I could remove by using na.rm = TRUE inside the mean() function. Sorry for the very basic questions requiring very basic solutions, but I am still very new to coding and teaching myself!

– Phil D
Nov 26 '18 at 11:35

add a comment |

library(data.table)

Data1<-data.table(Data1) #or if you have your data in csv document just use fread instead of read.csv which will automatically give you a data.table.

Then you just need to find the integer columns using this:

which(sapply(Data1,is.integer))

Putting it altogether using the data.table commands:

Data1[,which(sapply(Data1,is.integer)):=lapply(.SD,as.numeric),.SDcols=which(sapply(Data1,is.integer))]

Please update if you have further questions but this should answer your comment. Best of luck!

answered Nov 23 '18 at 0:11

Jason Johnson

965

Thanks for the advice, but I figured out the solution (to this problem at least) myself in the end, and it wasn't the integers causing this problem (although they caused a different problem), it was simply the presence of NA values, which I didn't realise I could remove by using na.rm = TRUE inside the mean() function. Sorry for the very basic questions requiring very basic solutions, but I am still very new to coding and teaching myself!

– Phil D
Nov 26 '18 at 11:35

add a comment |

library(data.table)

Data1<-data.table(Data1) #or if you have your data in csv document just use fread instead of read.csv which will automatically give you a data.table.

Then you just need to find the integer columns using this:

which(sapply(Data1,is.integer))

Putting it altogether using the data.table commands:

Data1[,which(sapply(Data1,is.integer)):=lapply(.SD,as.numeric),.SDcols=which(sapply(Data1,is.integer))]

Please update if you have further questions but this should answer your comment. Best of luck!

answered Nov 23 '18 at 0:11

Jason Johnson

965

library(data.table)

Data1<-data.table(Data1) #or if you have your data in csv document just use fread instead of read.csv which will automatically give you a data.table.

Then you just need to find the integer columns using this:

which(sapply(Data1,is.integer))

Putting it altogether using the data.table commands:

Data1[,which(sapply(Data1,is.integer)):=lapply(.SD,as.numeric),.SDcols=which(sapply(Data1,is.integer))]

Please update if you have further questions but this should answer your comment. Best of luck!

answered Nov 23 '18 at 0:11

Jason Johnson

965

answered Nov 23 '18 at 0:11

Jason Johnson

965

answered Nov 23 '18 at 0:11

Jason Johnson

965

answered Nov 23 '18 at 0:11

Jason Johnson

965

Thanks for the advice, but I figured out the solution (to this problem at least) myself in the end, and it wasn't the integers causing this problem (although they caused a different problem), it was simply the presence of NA values, which I didn't realise I could remove by using na.rm = TRUE inside the mean() function. Sorry for the very basic questions requiring very basic solutions, but I am still very new to coding and teaching myself!

– Phil D
Nov 26 '18 at 11:35

add a comment |

Thanks for the advice, but I figured out the solution (to this problem at least) myself in the end, and it wasn't the integers causing this problem (although they caused a different problem), it was simply the presence of NA values, which I didn't realise I could remove by using na.rm = TRUE inside the mean() function. Sorry for the very basic questions requiring very basic solutions, but I am still very new to coding and teaching myself!

– Phil D
Nov 26 '18 at 11:35

Thanks for the advice, but I figured out the solution (to this problem at least) myself in the end, and it wasn't the integers causing this problem (although they caused a different problem), it was simply the presence of NA values, which I didn't realise I could remove by using na.rm = TRUE inside the mean() function. Sorry for the very basic questions requiring very basic solutions, but I am still very new to coding and teaching myself!

– Phil D
Nov 26 '18 at 11:35

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Stack Overflow!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

x,ET0Y,POHXj vhpMt0L,YSSz3KgD

搜尋此網誌

Ytukyg