Question

Splice aware aligner - what does it mean?

18

Entering edit mode

8.2 years ago

jonasmst ▴ 410

Hi, I'm wondering what exactly is the meaning of an aligner being "splice aware". I know it has to do with the mapping of reads spanning splice junctions, but as someone pretty new to RNA-seq and molecular biology, that's not quite enough for me to grasp the concept.

The following is my best reasoning of its meaning. RNA-seq reads are derived from mature mRNA, so there's typically no introns in the sequence. But aligners use a reference genome to aid in the process, so a read spans (what in the actual transcript are) two exons, while the reference would have one exon followed by an intron. So the reference genome would find a matching sequence in only one of the exons, while the rest of the read would not match the intron in the reference, so the read can't be properly aligned. A splice-aware aligner would know not to try to align RNA-seq reads to introns, and would somehow identify possible downstream exons and try to align to those instead, ignoring introns altogether.

Is this anywhere close to the meaning of splice-aware? And if so, would a splice-unaware aligner properly align RNA-seq data, given a reference transcriptome?

RNA-Seq alignment splicing • 22k views

ADD COMMENT • link updated 8.2 years ago by Chris Cole ▴ 800 • written 8.2 years ago by jonasmst ▴ 410

4

Entering edit mode

8.2 years ago

Chris Cole ▴ 800

Manuel's answer is pretty much spot on.

The only thing worth adding is that aligning purely to the transcriptome limits you to what has been annotated as a gene/transcriptome. You will not find any new splice-sites nor unannotated genes. This is less likely in well studied model organisms (although see this paper). For anything that isn't human, mouse, yeast, fly or worm. I wouldn't recommend aligning to the transcriptome only. Despite this the newer pseudo-aligmentment methods (e.g. Kallisto and Salmon) only work with the transcriptome.

The choice of RNA-seq library also matters. With ribominus libraries you get many intronic reads generated from pre-mRNA molecules in your samples.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Chris Cole ▴ 800

Ram · Accepted Answer · 2016-02-04

16

Entering edit mode

8.2 years ago

Manuel Landesfeind ★ 1.4k

I think, that pretty much hits the point. :) The major problem is that introns not only vary in length but that they can also be very long. An DNA-DNA aligner (i.e., splice-unaware) would have to introduce a long gap in the mapping of a read to span an intron. This is not desired for DNA read mapping and might lead to false mappings. Without transcript information/restriction it is quite likely that the aligner is able to find a genome-sequence that matches the remaining read sequence. But that can be at any position downstream...

Regarding your second question, the splice-aware Tophat internally uses the splice-unaware Bowtie. So yes, in principle it is possible to use a splice-unaware aligner for mapping RNA-Seq reads to a transcriptome. However, using dedicated RNA-Seq read mapper or RNA-Seq read mapping work flow might give you better results by taking care of possible caveats etc. (read the papers for this ;-) ).

PS: See also Is it ok to map RNA-seq reads on prokaryotic reference genome with bowtie2 ?

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Manuel Landesfeind ★ 1.4k

8

Entering edit mode

To extend this...

Splice-aware aligners are not necessary when aligning to a transcriptome, only when aligning to a genome. A "splice-unaware" aligner will do a perfectly fine job of aligning to a transcriptome, with one caveat -

Transcriptomes of alternatively-spliced organisms (basically, Eukaryota) are both incomplete (since not all transcripts have been identified), and highly redundant (since transcripts have multiple isoforms). Both of these cause problems with all aligners. It's only one caveat, though, because splice-aware aligners encounter the same problems.

If you align to a genome, which I always recommend, splice-aware aligners are required. The main advantage of aligning to a transcriptome is speed; genome alignment is much more scientifically valuable, as it starts with fewer assumptions.

Note that I say this as someone who has developed a high-speed tool for quantifying transcript expression (Seal). It is probably 100x faster than BBMap (a splice-aware aligner) in most cases, and it does a very good job at quantifying expression differences. But, it presumes that your transcriptome is accurate, which it never is. Essentially, it forces your data into a mold that you know is wrong, while BBMap would actually allow you to discover new things, assuming that the genome is correct. Genomes are far more complete and accurate than transcriptomes.

If all you want to know is whether gene A or B is more upregulated in your experiment, then mapping to a transcriptome using any aligner is fine... but you could accomplish the same thing faster and probably more accurately using a kmer-matching tool like Seal. However, if you want to seriously study what is going on and care about differential splicing, you need to map to the full genome using a splice-aware aligner.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 8.2 years ago by Brian Bushnell 20k