When I specify
library-type to TopHat, i.e.,
first-strand, second-strand, unstranded, TopHat appends a value + or - to the
XS:A flag, which is useful for subsequent analyses, such as annotation.
However, does this information actually influence the "mappability" of reads, or is this unaffected?
My thinking is that the information would be considered for mapping reads to the GTF file if supplied with
In that dataset, read pairs should be concordant with transcript strand. i,e., if
-library-type first-strand was indicated, and transcript A is at coords. X to Y, on the + strand, MATE 1 of a pair should map to the reverse-complement of the 3' end of TRANSCRIPT A, and MATE 2 of the pair should map to the 5' end, in the same strand as the transcript sequence.
However, if no GTF is supplied with
-G, or in the subsequent stage of mapping reads that didn't map to the transcriptome, now to the whole genome, then TopHat should make no use of
library-type information, right?
--library-type TopHat will treat the reads as strand specific. Every read alignment will have an XS attribute tag. Consider supplying library type options below to select the correct RNA-seq protocol.
From TH Manual:
Since the splice junction finding algorithm of TopHat makes use of
library-typeinformation (if provided), one of the two TopHat runs would result in many more splice junctions than the other one. You can then use the library type that gives more junctions. If this is not the case TopHat might not work well with your sequencing protocol. Please let us know more details about your protocol so we can add support for new library types.
So this indicates that the strandedness argument does influence the mapping algorithm. But, HOW does TopHat use library-type information for its splice junction finding algoritm, if it has to be unbiased regarding on which strand actual transcripts exist?