Question: A very short gene with very high TPM
0
gravatar for corend
2.4 years ago by
corend70
corend70 wrote:

I assembled my RNA-seq data using cufflinks. This computed FPKM for each gene assembled, and I converted it manually to TPM using: [FPKM/sum(FPKM)]*1e6

I have a gene in all replicates that reaches 600k TPM, meaning that 60% of my transcripts are coming from this gene. I checked his length, 64nt in all of my replicates. And the number of reads matching it (20), in the first replicate.

Is it possible that this transcript is creating a biais in my data? Is it normal that only few reads can create a so high TPM? Maybe miss-mapped reads are creating a huge biais?

EDIT: Some more informations about my data:

RNA-seq, polyA library, unstranded, 3 ovary replicates, 3 testis replicates from my favorite species.

I cleaned my data, removed adapters, bad quality reads etc... The FastQC report is nice.

I just blasted my short sequence and I have 2 hits, only on my species. No predicted genes from ensembl or NCBI at this position. In my 3 ovary replicates I have this overTPMed gene (600k, 400k, 400k). And it is not in my testis replicates. When I check my bams with IGV, I see reads mapping this region in testes replicates too (still fewer than in ovaries), even if they are not assembled in the cufflinks output.

I just ordered my TPM table, I saw many other genes with anormal high TPM. I filtered out assembled genes with < 200nt. And rebuild my TPM table. Now the highest gene is 17k TPM. Is it a good thing to do in that case ?

rna-seq assembly • 1.5k views
ADD COMMENTlink modified 2.4 years ago by Friederike5.7k • written 2.4 years ago by corend70

bad quality reads etc... The FastQC report is nice

What criteria was used to define "bad" quality? It is not necessary to have nice FastQC report to do further analysis. One can be overzealous in the "cleanup" that can introduce some other bias in the data.

Since you are working with reproductive organs perhaps the gene (fragment) you are seeing may really be overexpressed?

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by genomax84k

I did not asked myself those questions, I used fastQC on raw reads, seeing that quality is not optimal and that I had adapters.

Then I used a cleaner (UrQt), and running fastQC on those cleaned reads I had no low quality bases, and no adapters anymore. I assumed that my data was ok for further analyses.

I expect overexpressed genes. But a genes producing 50% of my mRNA. It is too much in my opinion. Checking the real number of reads, which was pretty low, I came here to understand what's going on.

ADD REPLYlink written 2.4 years ago by corend70
4
gravatar for Matt Shirley
2.4 years ago by
Matt Shirley9.3k
Cambridge, MA
Matt Shirley9.3k wrote:

Your observation of very short transcripts receiving a disproportionate amount of the TPM is consistent with my observations. It's clear why this is the case. TPM is roughly:

(feature counts / feature length)
/ (sum of [all feature counts / feature lengths]) * 1e6

and so for features with a small length you will get a much larger numerator even will small counts.

One way to mitigate this issue is by using TMM scaling on your matrix of TPM values to remove the contribution of extreme values, maintaining the hypothesis that most genes are not differentially expressed in an experiment.

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Matt Shirley9.3k

What I am reading right now is that TMM scaling is usable on expected read count and not on TPM values. You think that I can use TMM on my TPM matrix?

TMM should be computed in terms of expected read counts.

I tried TMM with edgeR:

Input TPM Table: o

            o1         o2      o3         t1       t2       t3
XLOC_000001 14.290500  8.78600 14.069866  1.376451 3.24118  2.40424
XLOC_000002  5.644613 25.43497 13.819135 13.292400 9.45567  9.58778
XLOC_000003 10.140201  9.85379  7.777245  9.308872 8.46203  8.29616

d = DGEList(counts=o)
d = calcNormFactors(d,method="TMM")
counts.per.m <- as.data.frame(cpm(d, normalized.lib.sizes=TRUE))

Output TMM Table: counts.per.m

            o1         o2       o3        t1        t2       t3
XLOC_000001 17.001266  8.05838  7.289297  1.285580  4.84603  3.0483
XLOC_000002  6.715847 23.32895  7.159485 12.413405 14.12459 12.1217
XLOC_000003 12.072991  9.03748  4.029581  8.692491 12.66794 10.5005

I checked if I still have those superhigh TPM values, and yes, my anormal TPM values are still anormal in TMM table.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by corend70
1

I should have been more clear. The issue I solve by using TMM scaling is that because TPM is a unity measure (the sum of all values must be 1e6), highly expressed transcripts/genes in a single sample will cause all other genes to appear down regulated, when in fact they are not. TMM normalization scaling factors will actually make your issue worse, by scaling highly expressed genes to an even higher value. In your case it may be that features smaller than the insert size of your library cannot be accurately measured using TPM, and you may have better luck using counts data as input for differential expression analysis. From the paper I linked above:

The number of tags expected to map to a gene is not only dependent on the expression level and length of the gene, but also the composition of the RNA population that is being sampled. Thus, if a large number of genes are unique to, or highly expressed in, one experimental condition, the sequencing 'real estate' available for the remaining genes in that sample is decreased. If not adjusted for, this sampling artifact can force the DE analysis to be skewed towards one experimental condition.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Matt Shirley9.3k
2
gravatar for Friederike
2.4 years ago by
Friederike5.7k
United States
Friederike5.7k wrote:

Does your favorite species have an annotated transcriptome?

The reason I ask is because I think I remember that Cufflinks has a problem with single-exon, short transcripts (see, for example, this paper, from which this quote was taken:

All transcripts with length less than 200nt were extracted, and their corresponding accuracy metrics were calculated using counts and TPM values respectively (...) The Pearson coefficient was negative for Cufflinks (...).

(from the legend of Supplemental Figure 4))

So, maybe it's worth taking the generally more recommend route of kallisto/salmon + tximport.

In addition, I highly recommend you visually check the reads that align to that locus, e.g. using IGV

ADD COMMENTlink modified 2.4 years ago • written 2.4 years ago by Friederike5.7k

My species does have an annotated transcriptome. But it is not comprehensive. I would to discover new transcript, and get the TPM expression for all of them. And on the one hand, use kallisto + tximport to find DE genes on my new transcriptome. And on the other hand have the TPM values for all genes to assess which one are the more expressed.

Is there another way to get TPM values from those newly assembled transcripts without using cufflinks?

I checked my BAMS with IGV, and there are few reads mapping those short regions (around 20reads for 60bp).

ADD REPLYlink written 2.4 years ago by corend70

Is there another way to get TPM values from those newly assembled transcripts without using cufflinks?

Does cufflinks return the gene models, e.g. in gtf file format that you could use to build a transcriptome index for kallisto or to use with featureCounts?

Apart from that, there seems to be no shortage of transcriptome assemblers, I vaguely remember that RSEM tends to do well in benchmark papers.

I have no experience with transcriptome assembly, but I assume that the good old rule "the more the better" holds true as well, so you may want to consider using all the reads from all your samples and generate a transcriptome model that you then use with the usual reference-based methods. There may be good reasons not to do that, so reading up a bit more on the intricacies of transcriptome assembly may be warranted.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Friederike5.7k
1
gravatar for h.mon
2.4 years ago by
h.mon29k
Brazil
h.mon29k wrote:

Is it possible that this transcript is creating a biais in my data?

Yes, it is possible. Due to its high counts, the other transcripts will be less abundant in your sequencing, and there will be less power to estimate their abundance. And if this transcript is present in some samples and not others, this may create a bias which will affect differential expression analyses as well.

Is it normal that only few reads can create a so high TPM?

It is possible, though not common. However, we know nothing about your experiment, so we don't have enough information to give reasonable answers.

Maybe miss-mapped reads are creating a huge biais?

Blast the short contig to see if you can find its origin. Maybe it is adapters? Maybe ribosomal RNA? You didn't tell us what pre-processing steps you performed, nor the kind of sample / tissue / library you have.

ADD COMMENTlink written 2.4 years ago by h.mon29k

Just edited my post, thanks

ADD REPLYlink written 2.4 years ago by corend70
0
gravatar for harold.smith.tarheel
2.4 years ago by
United States
harold.smith.tarheel4.5k wrote:

The only short sequence string that's likely to present in 60% of reads is the adapter.

ADD COMMENTlink written 2.4 years ago by harold.smith.tarheel4.5k
1

I don't understand. 60% of my reads are not mapping to this gene. Only few of them are. But after TPM calculation it is 600k.

Am I wrong?

ADD REPLYlink written 2.4 years ago by corend70
1

You're correct. The TPM feature length scaling is taking a few reads and assigning most of the TPM space to them.

ADD REPLYlink written 2.4 years ago by Matt Shirley9.3k

I'm not convinced it's solely a problem of TPM since that should take care of the relative abundance issue. 20 reads for 60 bp does not seem like it should yield a massive enrichment, but my gut feeling could, of course, be completely off (since the result does depend on the rest of the transcriptome, sequencing depth etc.)

In fact, you could calculate the TPM yourself based on the number of reads that you see in IGV etc. (e.g., following Harold Pimentel's R code)

I think the issue is that Cufflinks does some magic when calculating its FPKMs which seems to yield a weird result for the described cases.

ADD REPLYlink modified 2.4 years ago • written 2.4 years ago by Friederike5.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1583 users visited in the last hour