Question: Collapse Probes For Same Gene
gravatar for Rituriya
8.0 years ago by
Rituriya30 wrote:

Dear All,

If there are more than one rows of expression for the same gene, collapse this gene into one row with highest value (maximum) within the column for that gene.

This file consists of 15 columns of expression values for 15 tissues. Its a text file containing affymetrix probes of Hgu133plus2 annotation. I have tried GATEexplorer, ADAPT, BrainCDF, etc. But none of them is useful to me. Can anyone suggest a solution? I tried using Genepattern also, but my file size is just too huge to accept.

gene • 9.1k views
ADD COMMENTlink written 8.0 years ago by Rituriya30
gravatar for Neilfws
8.0 years ago by
Sydney, Australia
Neilfws48k wrote:

This is quite easy in R.

First, you'll need an extra column with the gene names. Assuming your file is tab-delimited with column headers, read it into R:

mydat <- read.table("myfile.txt", header = T, sep = "\t")

Assuming that gene names are in column 16 with the header "gene", that column should now be of class factor:

[1] "factor"

You can now calculate the maximum by gene name using aggregate:

mydat.max <- aggregate(. ~ gene, data = mydat, max)

The new variable mydat.max is a data frame with gene names in the first column and one row of maximum values.

Just to show a dumb example - if the data frame mydat looks like this:

a    b  gene
1   11     A
2   12     A
3   13     A
4   14     A
5   15     A
6   16     B
7   17     B
8   18     B  
9   19     B
10  20     B

It becomes after aggregate:

gene    a  b
   A    5 15
   B   10 20
ADD COMMENTlink written 8.0 years ago by Neilfws48k

Thank you so much neilfws! This is exactly what I wanted. Thank you once again.

ADD REPLYlink written 8.0 years ago by Rituriya30
gravatar for ALchEmiXt
8.0 years ago by
The Netherlands
ALchEmiXt1.9k wrote:

Why should you want to do that in the first place?

Duplicates for genes on arrays are beneficial for controls, but they usually also allow you to detect differentially expressed variants (including possible splice variants)!? So Just combining them into a single gene value is loosing analysis resolution AND probably dangerous as well.

ADD COMMENTlink written 8.0 years ago by ALchEmiXt1.9k

It's quite common to collapse probes down to the gene level in some applications and often, a very simple metric such as median is used. You can argue that information is lost, but so is noise.

ADD REPLYlink written 8.0 years ago by Neilfws48k

@neilfws I agree on the data reduction and possibly the noise. But taking the highest value....?

ADD REPLYlink written 8.0 years ago by ALchEmiXt1.9k

Perhaps the op meant to  select the row with the lowest P-value (least likely to occur by chance)? Approach is suggested here.

ADD REPLYlink written 3.9 years ago by alexvpickering40
gravatar for Malachi Griffith
8.0 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith18k wrote:

If you have the .CEL file for your array and you wish to summarize it from the probe-level to the gene-level you might try Aroma, Expression Console, Affy Power Tools, RMAExpress, and many others.

Or perhaps, you already have a processed file that contains gene expression values but it still contains some cases where the same gene has multiple values. In that case, many people use scripting for those kinds of file manipulation. For example, using R, Perl, Awk, Python, etc. If that is what you mean, you can try posting a slice of your file and someone may provide examples...

ADD COMMENTlink modified 8.0 years ago • written 8.0 years ago by Malachi Griffith18k

Perhaps a small snippet/example using any of the packages in your first paragraph would be more informative.

ADD REPLYlink written 8.0 years ago by brentp23k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2059 users visited in the last hour