Question

Is transcriptome assembly a necessary step in RNA-seq?

1

Entering edit mode

4.0 years ago

tianshenbio ▴ 170

In my RNA-seq analysis, I use hisat2 to map my clean RNA-seq reads to the genome. Then, I use FeatureCounts to get the read count matrix of 'gene' in my gff file. Then I perform DE analysis using deseq2.

I noticed that some people use stringtie to assemble the transcripts after hisat2, then perform DE analysis. I feel that:

1. If I skipped transcriptome assembly, what I am actually counting would be the abundance of the PE reads/fragments mapped to the 'gene' in the gff file. 
2. If I perform transcriptome assembly, what I actually count would be the abundance of the assembled transcripts derived from the 'gene' in the gff file.

Well, it seems that what I've been doing is wrong... I feel that the abundance of the reads/fragments does not reveal the true level of gene expression since it's also dependent on gene length. For instance, if I have one gene with one exon that expressed once only. Assume the RNA-seq read length is 150bp

Case1: If the gene is 150bp, I would have one read/fragment mapped to it. 
Case2: If the gene is 300bp, I would have two mapped to it

Read count would be two in case2 but actually it's only one transcript, just longer than that in the first case. If transcripts are assembled, the gene would be counted as one in both cases.

But I do see a lot of ppl using the same pipeline as I do (Hisat2, FeatureCounts, Deseq2). I wonder if anyone could help clarify this?

RNA-Seq Assembly genome sequencing alignment • 1.7k views

ADD COMMENT • link updated 4.0 years ago by Istvan Albert 100k • written 4.0 years ago by tianshenbio ▴ 170

0

Entering edit mode

If you are working with well-annotated model organism and you are not specially interested in new transcripts then no, not necessary, and I think even not recommended as you probably have a higher chance of getting false-positives than actual new and meaningful new transcripts.

ADD REPLY • link 4.0 years ago by ATpoint 82k

0

Entering edit mode

Hi. But like I mentioned, If assembly is not done, the count number I get would be the abundance of reads mapped to the gene feature right? Then that would be length dependent since longer transcripts would simply have more reads regardless of expression level.

ADD REPLY • link 4.0 years ago by tianshenbio ▴ 170

score 3 · Accepted Answer · 2020-04-22

3

Entering edit mode

4.0 years ago

Istvan Albert 100k

It all depends on the quality of your existing annotations and the goals of the project.

Transcriptome assembly is far from being a reliable process, depending on the situation you could get a large number of either incorrectly assembled or missed transcripts.

Note that when you use HiSat2 you are (should be) mapping against the genome, not the transcriptome thus neither of your enumerations will capture reality. What you would be actually counting would be reads mapping to a genome, out which you have to still figure out which transcript they belong to.

Featurecounts is not the best tool to redistribute reads over transcripts. If you want to use transcripts you should classify with salmon or other tools that can will classify or align using the transcripts directly - no GTF should be required.

ADD COMMENT • link 4.0 years ago by Istvan Albert 100k

0

Entering edit mode

Hi, I am working on a non-model organism but the genome is quite well annotated. Actually I don't care about different transcripts. What I hope to achieve is simply examine the overall expression level of all genes registered in my gff file, not the individual transcripts. In that case, I don't need to assemble transcripts prior to FeatureCounts right?

ADD REPLY • link 4.0 years ago by tianshenbio ▴ 170

0

Entering edit mode

I don't consider feature counts as being the right tool for transcript-level analysis in the first place.

It does not have the ability to properly distinguish between transcripts thus your counts will always end up counted incorrectly.

feature counts should be used for gene level analysis, in which case, especially since as you say the genome is well annotated there is no need to assemble transcripts. If the genome is already well annotated it is unlikely that you would discover a brand new transcript that makes use of previously un-annotated exons.