Question: Is transcriptome assembly a necessary step in RNA-seq?
gravatar for tianshenbio
10 months ago by
tianshenbio70 wrote:

In my RNA-seq analysis, I use hisat2 to map my clean RNA-seq reads to the genome. Then, I use FeatureCounts to get the read count matrix of 'gene' in my gff file. Then I perform DE analysis using deseq2.

I noticed that some people use stringtie to assemble the transcripts after hisat2, then perform DE analysis. I feel that:

1. If I skipped transcriptome assembly, what I am actually counting would be the abundance of the PE reads/fragments mapped to the 'gene' in the gff file. 
2. If I perform transcriptome assembly, what I actually count would be the abundance of the assembled transcripts derived from the 'gene' in the gff file.

Well, it seems that what I've been doing is wrong... I feel that the abundance of the reads/fragments does not reveal the true level of gene expression since it's also dependent on gene length. For instance, if I have one gene with one exon that expressed once only. Assume the RNA-seq read length is 150bp

Case1: If the gene is 150bp, I would have one read/fragment mapped to it. 
Case2: If the gene is 300bp, I would have two mapped to it

Read count would be two in case2 but actually it's only one transcript, just longer than that in the first case. If transcripts are assembled, the gene would be counted as one in both cases.

But I do see a lot of ppl using the same pipeline as I do (Hisat2, FeatureCounts, Deseq2). I wonder if anyone could help clarify this?

ADD COMMENTlink modified 10 months ago by Istvan Albert ♦♦ 86k • written 10 months ago by tianshenbio70

If you are working with well-annotated model organism and you are not specially interested in new transcripts then no, not necessary, and I think even not recommended as you probably have a higher chance of getting false-positives than actual new and meaningful new transcripts.

ADD REPLYlink modified 10 months ago • written 10 months ago by ATpoint45k

Hi. But like I mentioned, If assembly is not done, the count number I get would be the abundance of reads mapped to the gene feature right? Then that would be length dependent since longer transcripts would simply have more reads regardless of expression level.

ADD REPLYlink modified 10 months ago • written 10 months ago by tianshenbio70
gravatar for Istvan Albert
10 months ago by
Istvan Albert ♦♦ 86k
University Park, USA
Istvan Albert ♦♦ 86k wrote:

It all depends on the quality of your existing annotations and the goals of the project.

Transcriptome assembly is far from being a reliable process, depending on the situation you could get a large number of either incorrectly assembled or missed transcripts.

Note that when you use HiSat2 you are (should be) mapping against the genome, not the transcriptome thus neither of your enumerations will capture reality. What you would be actually counting would be reads mapping to a genome, out which you have to still figure out which transcript they belong to.

Featurecounts is not the best tool to redistribute reads over transcripts. If you want to use transcripts you should classify with salmon or other tools that can will classify or align using the transcripts directly - no GTF should be required.

ADD COMMENTlink written 10 months ago by Istvan Albert ♦♦ 86k

Hi, I am working on a non-model organism but the genome is quite well annotated. Actually I don't care about different transcripts. What I hope to achieve is simply examine the overall expression level of all genes registered in my gff file, not the individual transcripts. In that case, I don't need to assemble transcripts prior to FeatureCounts right?

ADD REPLYlink written 10 months ago by tianshenbio70

I don't consider feature counts as being the right tool for transcript-level analysis in the first place.

It does not have the ability to properly distinguish between transcripts thus your counts will always end up counted incorrectly.

feature counts should be used for gene level analysis, in which case, especially since as you say the genome is well annotated there is no need to assemble transcripts. If the genome is already well annotated it is unlikely that you would discover a brand new transcript that makes use of previously un-annotated exons.

ADD REPLYlink written 10 months ago by Istvan Albert ♦♦ 86k

Thank you for your clarification!

ADD REPLYlink written 10 months ago by tianshenbio70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2195 users visited in the last hour