Question: Subset multiple columns in R or unix
0
gravatar for julie.sawitzke
9 months ago by
julie.sawitzke0 wrote:

I need to extract many columns from a dataset. I have a very large csv file with thousands of columns and rows. In R for example, I can read it in using:

mydata <- read.csv(file = "file.csv",header = TRUE,sep = ",",row.names = 1)

Each column is a gene name. I know how to extract specific columns from my R data.frame by using the basic code like this:

mydata[  , "GeneName1", "GeneName2"]

But my question is, how do I pull hundreds of gene names? Too many to type in? They are listed in a txt file.

I've used grep in UNIX before to pull multiple ROWS using a txt file with the list of genes I need, but I haven't been able to figure out how to do it with Columns.

ADD COMMENTlink modified 9 months ago by RamRS25k • written 9 months ago by julie.sawitzke0

Can you transpose the data frame and extract the resulting rows?

t_mydata<-t(mydata)
geneList <- read.table("your_geneList.txt")
subsampled_mydata <-  t_mydata[which( t_mydata$Gene %in% geneList),]

supposing there is a column Gene in your new t_mydata data frame

ADD REPLYlink written 9 months ago by daniele.avancini20

Please use the formatting bar (especially the code option) to present your post better. You can use backticks for inline code (`text` becomes text), or select a chunk of text and use the highlighted button to format it as a code block. I've done it for you this time.
code_formatting

ADD REPLYlink written 9 months ago by RamRS25k
1
gravatar for Friederike
9 months ago by
Friederike5.2k
United States
Friederike5.2k wrote:

in R, you could simply subset the data.frame that is returned by read.csv:

> test <- data.frame(A = c(1:3), B = c(3:5), C = c(6:8))
> test
  A B C
1 1 3 6
2 2 4 7
3 3 5 8

## spell out the column names you're interested in
> test[, c("A","B")]
  A B
1 1 3
2 2 4
3 3 5

## or use grepl
> test[, grepl("[A|B]", names(test)) ]
ADD COMMENTlink written 9 months ago by Friederike5.2k
1
gravatar for Bastien Hervé
9 months ago by
Bastien Hervé4.5k
Limoges, CBRS, France
Bastien Hervé4.5k wrote:

Read your genes list file and put it into a vector, then filter your dataframe using this vector

mydata <- read.csv(file = "file.csv",header = TRUE,sep = ",",row.names = 1)
genes_list <- scan("gene_list.txt", character(), quote = "")
mydata.new <- mydata[ ,genes_list]
ADD COMMENTlink modified 9 months ago • written 9 months ago by Bastien Hervé4.5k

Bastien, this worked, and so simple. Thank you!

ADD REPLYlink written 9 months ago by julie.sawitzke0

Bastien, one more question. Your code works well, but only if every gene on the list is found in the csv file. If R comes to a gene that is not there, it will quit. Is there something I can add to that last line to skip any genes that it does not find, and run the script anyway? The error I am getting is: Error in [.data.frame(Mydata, , gene_list) : undefined columns selected

ADD REPLYlink written 9 months ago by julie.sawitzke0
1
mydata.new <- mydata[ ,intersect(genes_list,colnames(mydata))]
ADD REPLYlink written 9 months ago by Bastien Hervé4.5k

Thank you!! That worked.

ADD REPLYlink written 9 months ago by julie.sawitzke0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 917 users visited in the last hour