I'm trying to understand a BAM file (how many reads are mapped, etc.) made from aligning a partial RNA seq library. Partial here means I only took the first 100,000 FASTQ records from the total paired end-reads (so 2 file of 100,000 reads each). Of these 200,000 I did some filtering that trims down the records to 174,904, then fed them into the mapper (I used GSNAP) to obtain the BAM file, and then sorts it using
samtools sort. Before mapping, I've made sure all read pairs are intact (no singletons were present).
Now, when doing a
samtools flagstat on the sorted BAM file, here's the output (with my comments afterwards):
200794 + 0 in total (QC-passed reads + QC-failed reads) 0 + 0 duplicates 196324 + 0 mapped (97.77%:-nan%) # call this mapped 200794 + 0 paired in sequencing # call this paired 100187 + 0 read1 # this is read1 100607 + 0 read2 # this is read2 193396 + 0 properly paired (96.32%:-nan%) # call this properly_paired 194097 + 0 with itself and mate mapped # this is mate_mapped 2227 + 0 singletons (1.11%:-nan%) # singletons 500 + 0 with mate mapped to a different chr #and this is mate_weird 328 + 0 with mate mapped to a different chr (mapQ>=5)
I then tried several
samtools view command with the
-c and -f flags to see if I understand the outputs correct, e.g.:
$ samtools view -c -f 0x2 file.bam # selects for properly paired reads 193396 (checks out with flagstat) $ samtools view -c -f 0x40 file.bam # selects for first read 100187 (checks out with flagstat read1) $ samtools view -c -f 0x80 file.bam # selects for last read 100607 (checks out with flagstat read2) $ samtools view -c -f 0x42 file.bam # selects for properly paired first read 96702 $ samtools view -c -f 0x82 file.bam # selects for properly paired last read 96694
Now, my questions are:
Of the initial 174,904 FASTQ records fed into the mapper, why does flagstat show 200,794 reads? I thought perhaps this is because there may be split reads (reads mapped to different exons), but I'm not sure..
How come there are different numbers of mapped of
read2both before and after I select for properly paired reads? Aren't they supposed to be the same?
I did some counting just to figure out if the numbers check out, but I think I may have missed some. In particular, I don't understand these:
mate_mappedis 6697, while
singletonsis 2227. Why the discrepancy (4470) and what are those sequences?
mate_weirdis 193,597, while
properly_pairedis 193,396 (201 reads discrepancy). Why the discrepancy and what are those sequences?
Thanks in advance for the help :)!