Question

Splice-Aware Aligners Versus Genomic Aligners To Transcriptome

1

Entering edit mode

10.4 years ago

dfernan ▴ 760

Hi,

I have a question regarding the alignment of RNA-Seq data.

Let us consider the following two strategies:

(1) Genome ALigner to Transcriptome. E.g., use bowtie to align the reads to a fasta file with all transcript sequences - strategy used by several DE software, bitseq, RSem, mmseq, etc.

(2) Splice-aware aligner. E.g., tophat, STAR, etc.

Which is a better approach, (1) or (2) for aligning RNA-Seq data? Probably the (2) approach is more principle but if that's the case what are the hazards/limitations of the first approach. I can think of a trivial case for total RNA since the reads won't ONLY come from the transcriptome but I am more interested in the case of polyA+ type of experiments, where most reads should come from the transcriptome.

I am looking for practical experience with the data as well as any theoretical consideration.

alignment • 6.2k views

ADD COMMENT • link updated 9.9 years ago by Charles Warden 8.2k • written 10.4 years ago by dfernan ▴ 760

1

Entering edit mode

Do you care about novel splice variants or new transcripts that may be represented? If the answer is yes, then you need a splice-aware aligner. If no, then it still depends on how reliable you believe your reference transcriptome is.

ADD REPLY • link 10.4 years ago by Chris Miller 22k

0

Entering edit mode

@Chris, I am using mouse transcriptome (ensembl, refseq, gencode, any transcriptome one can think of), with respect to its reliability I have the feeling is still a matter under debate... I do not care about novel splice variants but I am trying to asses the pitfalls or (none) pitfalls of approach (1) vs (2). One way to do it is to try both, then intersect the genomic alignment with the GTF transcriptome and see if the alignments agree... to my surprise if i do that for short single end data the agreement is quite weak, these are prelim results though... maybe someone has tried a more extended study...

ADD REPLY • link 10.4 years ago by dfernan ▴ 760

0

Entering edit mode

I am not sure if a bowtie will be able to align a transcript that is generated from two non-adjacent exons on to the transcriptome fasta sequence where you have concatenated the adjacent exons for mapping purpose. I may be wrong on this.

ADD REPLY • link 10.4 years ago by Ashutosh Pandey 12k

0

Entering edit mode

@ashutosmits well, if it's not in the transcriptome GTF you are absolutely right but in principle the GTF contain all the alternative variants...

ADD REPLY • link 10.4 years ago by dfernan ▴ 760

0

Entering edit mode

Ok got it. I didnt read that you will be using tools like Rsem. I thought you would do it yourself from scratch.

ADD REPLY • link 10.4 years ago by Ashutosh Pandey 12k

0

Entering edit mode

With approach 1, you are going to have multiple alignments to different transcripts for a read that aligns to a shared exon. You'll need to have a plan to deal with that situation.

ADD REPLY • link 10.4 years ago by Sean Davis 26k

0

Entering edit mode

@sean thanks, good point but I plan to uSe the alignments for transcript/gene expression estimation with one of the many available software such as bitseq, express, rsem, etc. My concern is forcing the alignments into "known" annotations and what is the effect of that.

ADD REPLY • link 10.4 years ago by dfernan ▴ 760

0

Entering edit mode

That's exactly the problem as @Sean Davis mentioned. You will map one read multiple times just by using a non spliced aligner. Therefore the amount of mappable reads may drop in a certain manner (which I can't think of in times of magnitude) and therefore you change the expression of some genes artificially. Especially if all transcript variants are present.

ADD REPLY • link 10.4 years ago by Phil S. ▴ 700

Ram · Answer 1 · 2014-05-09

Looks like there is a long comment thread with good feedback.

For my two cents, if you don't care about splicing or novel genes (so, you only care about the expression of known genes), I think the result should be similar either way. However, you need to use the right strategy.

For example, I think this blog posts addresses your question:

http://cdwscience.blogspot.com/2014/02/mrna-quantification-via-express.html

You can see that Bowtie + eXpress (option #1 that you describe) is pretty similar to TopHat + cufflinks (option #2 that you describe). However, if you simply look at counts, the results will be different (presumably worse): for example, the abundances are noticeably different for idxstats compared to either Bowtie+eXpress or TopHat+cufflinks.