I am using tophat to align 100bp single end RNAseq reads to the human transcriptome (using hg19). I have noticed a large difference between the number of reads reported in the prep_reads step and the align_summary step.
As an example here it the prep_reads.info file from one of my samples:
min_read_len=101 max_read_len=101 reads_in =22536887 reads_out=22535224
And here is the align summary:
Reads: Input : 2599620 Mapped : 1557662 (59.9% of input) of these: 125871 ( 8.1%) have multiple alignments (80 have >20) 59.9% overall read mapping rate.
Why is the number of reads set in much higher than the number of reads listed as input when calculating the alignment rate. My understanding is that the prep reads step is the one which filters out reads.