Question

Subset multiple columns in R or unix

0

Entering edit mode

6.3 years ago

julie.sawitzke • 0

I need to extract many columns from a dataset. I have a very large csv file with thousands of columns and rows. In R for example, I can read it in using:

mydata <- read.csv(file = "file.csv",header = TRUE,sep = ",",row.names = 1)

Each column is a gene name. I know how to extract specific columns from my R data.frame by using the basic code like this:

mydata[  , "GeneName1", "GeneName2"]

But my question is, how do I pull hundreds of gene names? Too many to type in? They are listed in a txt file.

I've used grep in UNIX before to pull multiple ROWS using a txt file with the list of genes I need, but I haven't been able to figure out how to do it with Columns.

subset pull columns R subset columns • 9.4k views

ADD COMMENT • link updated 6.3 years ago by Ram 45k • written 6.3 years ago by julie.sawitzke • 0

0

Entering edit mode

Can you transpose the data frame and extract the resulting rows?

t_mydata<-t(mydata)
geneList <- read.table("your_geneList.txt")
subsampled_mydata <-  t_mydata[which( t_mydata$Gene %in% geneList),]

supposing there is a column Gene in your new t_mydata data frame

ADD REPLY • link 6.3 years ago by daniele.avancini ▴ 70

0

Entering edit mode

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLY • link 6.3 years ago by Ram 45k

score 1 · Answer 1 · 2019-04-03

in R, you could simply subset the data.frame that is returned by read.csv:

> test <- data.frame(A = c(1:3), B = c(3:5), C = c(6:8))
> test
  A B C
1 1 3 6
2 2 4 7
3 3 5 8

## spell out the column names you're interested in
> test[, c("A","B")]
  A B
1 1 3
2 2 4
3 3 5

## or use grepl
> test[, grepl("[A|B]", names(test)) ]

score 1 · Answer 2 · 2019-04-03

1

Entering edit mode

6.3 years ago

Bastien Hervé 6.4k

Read your genes list file and put it into a vector, then filter your dataframe using this vector

mydata <- read.csv(file = "file.csv",header = TRUE,sep = ",",row.names = 1)
genes_list <- scan("gene_list.txt", character(), quote = "")
mydata.new <- mydata[ ,genes_list]

ADD COMMENT • link 6.3 years ago by Bastien Hervé 6.4k

0

Entering edit mode

Bastien, this worked, and so simple. Thank you!

ADD REPLY • link 6.3 years ago by julie.sawitzke • 0

0

Entering edit mode

Bastien, one more question. Your code works well, but only if every gene on the list is found in the csv file. If R comes to a gene that is not there, it will quit. Is there something I can add to that last line to skip any genes that it does not find, and run the script anyway? The error I am getting is: Error in [.data.frame(Mydata, , gene_list) : undefined columns selected

ADD REPLY • link 6.3 years ago by julie.sawitzke • 0

1

Entering edit mode

mydata.new <- mydata[ ,intersect(genes_list,colnames(mydata))]