Entering edit mode
7.4 years ago
sgblackpearl ▴ 10
If you want to do a correlation analysis in RNASeq data, how do you analyze gene A and gene B where gene A has 5 transcripts and gene B has, let's say 6 transcripts. Should I take mean of gene A and gene B expression respectively?
Instead of averaging (which might club together isoforms with widely different expression pattern), use the APPRIS db. Assuming your data is from any of the well studied model organisms, you can go here, to find the principal isoform for any given gene. Like the page for TP53. With more than one principal isoform I guess you could choose either.
The APPRIS anno. are also available as part of Ensembl BioMart =>
Then take only that isoform and do comparisons.
thanks Amit for your suggestion. Unfortunately my data is not from any of the model organism, rather it is from an unsequenced genome. My DGE list contains around 5k transcripts belonging to around 1000 genes. What to do? Should I go for highest scoring transcript?
There is not an "accepted" way to do this. Depending on your use, though, choosing the highest-scoring transcript seems reasonable. If this is RNA-seq, though, why not just summarize to gene to begin with?
You say its an unsequenced genome and yet you have isoform info. available. I'm not sure if you know which of the isoforms are protein-coding. If that info is available then you could select based on that.
In my experience with rna-seq data (humans only though), I have seen many times a transcript isoform with very high expression level and when I look at the biotype it turns out to be nonsense mediated decay (NMD) candidate or 'processed transcript' or similar ncRNA variants.
Hence I am not comfortable with going for the highest expressed isoform. But again I am not sure how comprehensive is the gene anno. info for your organism. If mostly its the protein-coding variants and not many ncRNAs, then you could summarize the isoforms for each gene as Sean already suggested.
Exactly Amit, The annotation is not comprehensive for this organism. What exactly you mean by summarizing the isoform for each gene? taking average?
What I meant was you could average over the isoforms for each gene. This should be ok if most isoforms are protein-coding variants and not many ncRNA isoforms are present.
Amit, do you have any script for calculating pearson correlation between gene A and gene B, each having expression data from multiple timepoints
I have used R for calculating corr. coeff., using the
cor.testcommand. I must say I haven't done much for specifically gene exp. I tried MINE on a large set of Illumina beadarrays and also tried this R pkg which also employs non-parametric distance measures. This was some exploratory stuff and left it halfway after I ran into memory problems (>100 arrays, >40k genes). Can't help you enough.