Hello

I am new in RNA-seq. I am using Tophat2 to map single end reads to mm9. I am using tophat2 in this way:

tophat -p 10 --max-multihits 1 -G genes_mm9.gff -o output genome_mm9 reads.fastq

With --max-multihits 1, I assume I will get 1 alignment per read. Assuming that, the number of total reads that tophat2 uses for the mapping (19196075) should be the number of alignments in accepted_hits.bam file (8797938) (because --max-multihits 1) plus the total reads in unmapped.bam file (7538885). But that is not the case, there are 2859252 reads missing. Am I correct?

Samuel

RNA-Seq next-gen rna-seq alignment • 2.0k views
Were all the reads of the same length? tophat2 will filter out reads that are too short.

19196075 is the reads used for mapping.

reads_in =19197489


Thanks Ashtosh, that was my first thought. But reading the tophat manual is not really clear for me. According to the tophat manual:

-g/--max-multihits <int> Instructs TopHat to allow up to this many alignments to the
reference for a given read, and choose the alignments based on
their alignment scores if there are more than this number. The
default is 20 for read mapping. Unless you use
--report-secondary-alignments, TopHat will report the
alignments with the best alignment score. If there are more
alignments with the same score than this number, TopHat will
randomly report only this many alignments. In case of using
--report-secondary-alignments, TopHat will try to report
alignments up to this option value, and TopHat may randomly
output some of the alignments with the same score to meet
this number.


With -g/--max-multihits 1, what I understand is that TopHat will report the best alignment for each read, or randomly select one alignment in case of several alignments with the same score.

Maybe my interpretation is wrong.

Thanks

