Question: How To Find Differentially Expressed Transcripts Across Tissues?
gravatar for daattali
6.2 years ago by
daattali40 wrote:

Hello fellow bioinformaticians,

This may well be an easy and solved problem, but I didn't find a standard solution for this. I'm also extremely new to the field, so please excuse me :)

I have expression data for different transcripts from 386 proteins in 25 different tissues (from GTEx - yes, the one that was getting all the bad rep recently...). I'm trying to find out if there are any proteins that have transcripts that are differentially expressed across tissues. I know that the transcripts themselves will be expressed at very different levels, but I want to find out what transcripts have a different expressions pattern.

What I'm doing right now is:
For each protein:
- Get the RPKM values for each transcript in each tissue
- Sort the transcripts based on total RPKM across all tissues (so that the "reference" transcript is the one that's expressed the most)
- Perform linear model fitting in R rpkm ~ tissue * transcript
- At this point I wasn't sure what to do exactly to figure out the important ones. I tried just performing ANOVA, but that seems to return that ALL proteins are significant. I tried looking at the summary of the model for each protein and just pick out the coefficients that corresponded to a low p value for a tissue-transcript, but that seemed to not give correct results either.

So in short, I'm just wondering if there's a standard tool or pipeline for determining if different transcripts of the same gene have different expression patterns across tissues

ADD COMMENTlink modified 6.2 years ago by lkmklsmn920 • written 6.2 years ago by daattali40

That's an interesting question, although I couldn't manage to exactly figure what you are after. The notion of "pattern" would require some definition I think. If the reference transcript of a protein is highly expressed in a single tissue, and not expressed at all in the others, would it be a hit? Since you use the word "pattern", I immediately though of bi-clustering. Wouldn't that give you groups of transcripts having similar expression profiles across a wide range of samples?

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Hayssam270

To simplify matters, I would just define the expression of the "reference" transcript as the "pattern" to compare other transcripts against. I realize that this is not perfect, but I figure it's a fair enough starting point. So, given the expression of the reference transcript, which transcripts vary in the expression pattern at some tissue? I generated heatmaps to see this visually, and it looks like for many proteins, most transcripts follow a very similar expression pattern across tissues, but there are interesting cases where one tissue would have a spike at a specific tissue that the other transcripts don't. This is the kind of data I want to find systematically rather than visually.

ADD REPLYlink written 6.2 years ago by daattali40

The word "pattern" is used a bit too liberally for me to grasp what you mean, sorry :( But I think I understood what you're after: Not differential expression per se, but more to answer questions of they type " Is transcript A the reference transcript in all tissues?", where reference transcript is defined as the transcript with max RPKM level. Similarly, "by reference to transcript A, is transcript B always second across all tissues?". If that's the case, I think I found something for you.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Hayssam270

Sort of :)
Let's say this is my data (column names are transcripts, row names are tissue types):

                |   A  |   B   |  C  
      | brain   |  10  |   5   |  1  
      | liver   |  20  |  10   |  2  
      | lung    |   5  |   2   | 0.4  
      | kidney  |  30  |   3   | 3.5

I'll define A as the reference transcript simply because it has the highest total RPKM (10+20+5+30)
For transcript C, you can see that while the transcript itself is expressed much less, it follows the same "pattern" - all the values are roughly 1/10 of A. But for B, based on brain,liver,lung it seems like B is expressed at half the frequency of A, but kidney doesn't follow that pattern - it is way underexpressed (rpkm of only 3 instead of expected ~15)

So in this case, I would want to mathematically learn that from this dataset, transcript B at kidney is an interesting observation.

Hopefully this makes it a little clearer. If not, don't worry too much about it, I'll figure it out :)

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by daattali40

Yep it does make things clearer! Do you have replicates? Or can you group the tissues so as to have more degrees of freedom?

ADD REPLYlink written 6.2 years ago by Hayssam270

There are multiple samples form each transcript (coming from multiple people) However, the number of samples per tissue is not consistent. For example, there are over 300 samples from brain, but 50-100 for most other tissues.
So does the tool you mentioned help with this?

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by daattali40
gravatar for lkmklsmn
6.2 years ago by
United States
lkmklsmn920 wrote:

I am assuming your performed a general ANOVA (e.g. aov function in R). In this case your Null hypothesis would be that there are no changes in expression for a given gene between any of the 25 cell types. You would expect that a given gene would be different in at least 2 cell types and therefore you end up with the high number of significant genes. You need to adjust your ANOVA in order to ask more specific question about differential expression between celltype A and B. This could be simply done by an ordinary t-test. However, you loose power and you should incorporate the information from other celltypes in a linear contrast t-test. In general, I think what you are after is the right set of contrasts to ask specific questions between certain celltypes, set of celltypes. I would encourage you to google sth like: linear model contrasts (there is an R package called 'contrast' for this). Having said this, you must be careful to transform and normalize your data appropriately to use it for this approach.

ADD COMMENTlink written 6.2 years ago by lkmklsmn920
gravatar for Damian Kao
6.2 years ago by
Damian Kao15k
Damian Kao15k wrote:

edgeR or DEseq are the 2 popular packages for determining differential expression. You need raw tag counts rather than RPKM though.

More specifically, your question is how to find distinct patterns across multiple tissue types (more than 2). I am not aware of any standard tools to do this. While the significance differential expression can be assessed with the R packages above, the magnitude of the differential expression might not be comparable among the various pair-wise differential expression tests.

ADD COMMENTlink written 6.2 years ago by Damian Kao15k

I think the real difficulty here is in the inter-library normalization step (this will apply regardless of whether you use raw counts or RPKM). The normal methods (TMM, etc.) assume that most genes aren't differentially expressed between groups. I'm not sure how true that is between some tissues, where you might imagine that actually everything thing is very different. I wonder if the only reliable method would be to use spike-ins and normalize to that.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Devon Ryan93k

You are correct in that almost all the transcripts I'm looking at (for each given protein) have very different expression levels. For example: transcript A could have an average RPKM value of 4, while transcript B would have an average RPKM of 0.7, but they could have the same "pattern" across tissues. One thing I can think of is maybe normalizing all transcripts to have the same mean, maybe that will produce better results.

ADD REPLYlink written 6.2 years ago by daattali40

My "very different" comment was in reference to all transcripts of interest having, for example, a 2x difference in expression between a given set of tissues. I don't know of tissue differences like this (I've never looked), but I have seen mutation (something with c-myc?) papers were there was just transcriptional amplification between groups that would be completely missed by the normal methods. I should note that you needn't have this degree of effect for the normalization methods to not work.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Devon Ryan93k

Ah, sorry for misunderstanding. The effect you're describing is also very apparent in the data though - for every protein I can clearly see that there are tissues where all the transcripts are expressed fairly high, whereas there are other tissues where all the transcripts have a proportionally lower expression.

ADD REPLYlink written 6.2 years ago by daattali40

Correct. I would suggest you standardize your data so that each gene has mean 0 and variance 1 across all celltypes. This will definitely make your heatmap look a lot better.

ADD REPLYlink written 6.2 years ago by lkmklsmn920

I initially thought about also making the variances match, but I came to the conclusion that doing that will be destructive since I want to try to find differences in the variation across tissues. I could try it though, and see what it gives me

ADD REPLYlink written 6.2 years ago by daattali40

I agree. Simple solution is to have spike-ins. I think some the GTEx data have spike-ins?

ADD REPLYlink written 6.2 years ago by Damian Kao15k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1623 users visited in the last hour