Question

Does Tophat Use The Library-Type Information For Mapping, Or Just For The Xs Flag?

4

Entering edit mode

12.2 years ago

gaelgarcia05 ▴ 280

When I specify library-type to TopHat, i.e., first-strand, second-strand, unstranded, TopHat appends a value + or - to the XS:A flag, which is useful for subsequent analyses, such as annotation.

However, does this information actually influence the "mappability" of reads, or is this unaffected?

My thinking is that the information would be considered for mapping reads to the GTF file if supplied with -G .

In that dataset, read pairs should be concordant with transcript strand. i,e., if -library-type first-strand was indicated, and transcript A is at coords. X to Y, on the + strand, MATE 1 of a pair should map to the reverse-complement of the 3' end of TRANSCRIPT A, and MATE 2 of the pair should map to the 5' end, in the same strand as the transcript sequence.

However, if no GTF is supplied with -G, or in the subsequent stage of mapping reads that didn't map to the transcriptome, now to the whole genome, then TopHat should make no use of library-type information, right?

    --library-type     

TopHat will treat the reads as strand specific. Every read alignment will have an XS attribute tag. Consider supplying library type options below to select the correct RNA-seq protocol.

From TH Manual:

Since the splice junction finding algorithm of TopHat makes use of library-type information (if provided), one of the two TopHat runs would result in many more splice junctions than the other one. You can then use the library type that gives more junctions. If this is not the case TopHat might not work well with your sequencing protocol. Please let us know more details about your protocol so we can add support for new library types.

So this indicates that the strandedness argument does influence the mapping algorithm. But, HOW does TopHat use library-type information for its splice junction finding algoritm, if it has to be unbiased regarding on which strand actual transcripts exist?

tophat rna-seq rnaseq mapping reads • 5.8k views

ADD COMMENT • link updated 11.9 years ago by Kanne ▴ 450 • written 12.2 years ago by gaelgarcia05 ▴ 280

score 1 · Answer 1 · 2013-05-16

1

Entering edit mode

12.2 years ago

Ashutosh Pandey 12k

Specifying the correct library type will ensure that the paired reads are mapped correctly and should increase the mappability (if you meant the ability to align the reads)

ADD COMMENT • link 12.2 years ago by Ashutosh Pandey 12k

0

Entering edit mode

Thanks, ashutosmits! Yes, that is what I meant. However, I can't see the way this would influence the ability to align reads. Does it have to do with the GTF file supplied in case of selecting -G? Otherwise, I don't see how TopHat would make assumptions about what should map to the + or - strand....

ADD REPLY • link 12.2 years ago by gaelgarcia05 ▴ 280

score 1 · Answer 2 · 2013-09-09

I understand your confusion and this this thought just occurred to me. I haven't spent very long thinking about it so maybe I'm forgetting something but here's a suggestion anyway:

Tophat is a spliced read aligner. If you do not supply -G then it will still align reads over splicing junctions, it will just figure out the slicing junctions de novo. Tophat uses the canonical donor/acceptor sequences when it defines splice sites. Hence, if you specify that your libraries are strand-specific, it would make sense for tophat to only look for the canonical donor/acceptor sequences in the read which represents the RNA transcript, and the reverse complement of the canonical donor/acceptor sites in the other read, and to ignore any canonical splice sequences on the biologically irrelevant strand. If you specify that your library is not strand-specific, it will need to look for the donor/acceptor site and it's reverse complement in both reads, since it can't be sure which strand the transcript originated from... If my thought is correct, if you have a strand-specific library and specified it as such, you would end up with less opportunity for identification of false positive junctions, and presumably a faster run time too.