Question: Does Tophat Use The Library-Type Information For Mapping, Or Just For The Xs Flag?
4
gravatar for gaelgarcia05
5.5 years ago by
gaelgarcia05180
UK
gaelgarcia05180 wrote:

When I specify library-type to TopHat, i.e., first-strand, second-strand, unstranded, TopHat appends a value + or - to the XS:A flag, which is useful for subsequent analyses, such as annotation.

However, does this information actually influence the "mappability" of reads, or is this unaffected?

My thinking is that the information would be considered for mapping reads to the GTF file if supplied with -G .

In that dataset, read pairs should be concordant with transcript strand. i,e., if -library-type first-strand was indicated, and transcript A is at coords. X to Y, on the + strand, MATE 1 of a pair should map to the reverse-complement of the 3' end of TRANSCRIPT A, and MATE 2 of the pair should map to the 5' end, in the same strand as the transcript sequence.

However, if no GTF is supplied with -G, or in the subsequent stage of mapping reads that didn't map to the transcriptome, now to the whole genome, then TopHat should make no use of library-type information, right?

    --library-type     

TopHat will treat the reads as strand specific. Every read alignment will have an XS attribute tag. Consider supplying library type options below to select the correct RNA-seq protocol.

From TH Manual:

Since the splice junction finding algorithm of TopHat makes use of library-type information (if provided), one of the two TopHat runs would result in many more splice junctions than the other one. You can then use the library type that gives more junctions. If this is not the case TopHat might not work well with your sequencing protocol. Please let us know more details about your protocol so we can add support for new library types.

So this indicates that the strandedness argument does influence the mapping algorithm. But, HOW does TopHat use library-type information for its splice junction finding algoritm, if it has to be unbiased regarding on which strand actual transcripts exist?

rnaseq reads tophat mapping rna-seq • 3.8k views
ADD COMMENTlink modified 5.2 years ago by Kanne400 • written 5.5 years ago by gaelgarcia05180
1
gravatar for Ashutosh Pandey
5.5 years ago by
Philadelphia
Ashutosh Pandey11k wrote:

Specifying the correct library type will ensure that the paired reads are mapped correctly and should increase the mappability (if you meant the ability to align the reads)

ADD COMMENTlink written 5.5 years ago by Ashutosh Pandey11k

Thanks, ashutosmits! Yes, that is what I meant. However, I can't see the way this would influence the ability to align reads. Does it have to do with the GTF file supplied in case of selecting -G? Otherwise, I don't see how TopHat would make assumptions about what should map to the + or - strand....

ADD REPLYlink written 5.5 years ago by gaelgarcia05180
1
gravatar for Kanne
5.2 years ago by
Kanne400
Australia
Kanne400 wrote:

I understand your confusion and this this thought just occurred to me. I haven't spent very long thinking about it so maybe I'm forgetting something but here's a suggestion anyway:

Tophat is a spliced read aligner. If you do not supply -G then it will still align reads over splicing junctions, it will just figure out the slicing junctions de novo. Tophat uses the canonical donor/acceptor sequences when it defines splice sites. Hence, if you specify that your libraries are strand-specific, it would make sense for tophat to only look for the canonical donor/acceptor sequences in the read which represents the RNA transcript, and the reverse complement of the canonical donor/acceptor sites in the other read, and to ignore any canonical splice sequences on the biologically irrelevant strand. If you specify that your library is not strand-specific, it will need to look for the donor/acceptor site and it's reverse complement in both reads, since it can't be sure which strand the transcript originated from... If my thought is correct, if you have a strand-specific library and specified it as such, you would end up with less opportunity for identification of false positive junctions, and presumably a faster run time too.

ADD COMMENTlink written 5.2 years ago by Kanne400
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1365 users visited in the last hour