Subset multiple columns in R or unix
2
0
Entering edit mode
3.1 years ago

I need to extract many columns from a dataset. I have a very large csv file with thousands of columns and rows. In R for example, I can read it in using:

mydata <- read.csv(file = "file.csv",header = TRUE,sep = ",",row.names = 1)


Each column is a gene name. I know how to extract specific columns from my R data.frame by using the basic code like this:

mydata[  , "GeneName1", "GeneName2"]


But my question is, how do I pull hundreds of gene names? Too many to type in? They are listed in a txt file.

I've used grep in UNIX before to pull multiple ROWS using a txt file with the list of genes I need, but I haven't been able to figure out how to do it with Columns.

subset pull columns R subset columns • 7.4k views
0
Entering edit mode

Can you transpose the data frame and extract the resulting rows?

t_mydata<-t(mydata)
subsampled_mydata <-  t_mydata[which( t_mydata\$Gene %in% geneList),]


supposing there is a column Gene in your new t_mydata data frame

0
Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (text becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.

1
Entering edit mode
3.1 years ago

in R, you could simply subset the data.frame that is returned by read.csv:

> test <- data.frame(A = c(1:3), B = c(3:5), C = c(6:8))
> test
A B C
1 1 3 6
2 2 4 7
3 3 5 8

## spell out the column names you're interested in
> test[, c("A","B")]
A B
1 1 3
2 2 4
3 3 5

## or use grepl
> test[, grepl("[A|B]", names(test)) ]

1
Entering edit mode
3.1 years ago

Read your genes list file and put it into a vector, then filter your dataframe using this vector

mydata <- read.csv(file = "file.csv",header = TRUE,sep = ",",row.names = 1)
genes_list <- scan("gene_list.txt", character(), quote = "")
mydata.new <- mydata[ ,genes_list]

0
Entering edit mode

Bastien, this worked, and so simple. Thank you!

0
Entering edit mode

Bastien, one more question. Your code works well, but only if every gene on the list is found in the csv file. If R comes to a gene that is not there, it will quit. Is there something I can add to that last line to skip any genes that it does not find, and run the script anyway? The error I am getting is: Error in [.data.frame(Mydata, , gene_list) : undefined columns selected

1
Entering edit mode
mydata.new <- mydata[ ,intersect(genes_list,colnames(mydata))]

0
Entering edit mode

Thank you!! That worked.