7.1 years ago by
Washington University School of Medicine, St. Louis, USA
Fastq files do not contain alignments but rather raw read sequences to be aligned. Therefore you may have reads in your fastq that are not represented in acceptedhits.bam because those reads do not align. You will also have reads in your fastq that correspond to more than one alignment in the acceptedhits.bam file in cases where the placement of a read sequence in your genome is ambiguous. i.e. it matches equally well to multiple places. By default, in such cases, TopHat allows up to 20 'multi-hits'. You can control this behavior with the option '-g/--max-multihits <int>'. From the docs:
Instructs TopHat to allow up to this many alignments to the reference for a given read, and choose the alignments based on their alignment scores if there are more than this number. The default is 20 for read mapping. Unless you use --report-secondary-alignments, TopHat will report the alignments with the best alignment score. If there are more alignments with the same score than this number, TopHat will randomly report only this many alignments. In case of using --report-secondary-alignments, TopHat will try to report alignments up to this option value, and TopHat may randomly output some of the alignments with the same score to meet this number.