Question: (Closed) How To Interpret/Reconcile The Counts Obtained Via Samtools Flagstat
3
gravatar for bow
6.2 years ago by
bow780
Netherlands
bow780 wrote:

I'm trying to understand a BAM file (how many reads are mapped, etc.) made from aligning a partial RNA seq library. Partial here means I only took the first 100,000 FASTQ records from the total paired end-reads (so 2 file of 100,000 reads each). Of these 200,000 I did some filtering that trims down the records to 174,904, then fed them into the mapper (I used GSNAP) to obtain the BAM file, and then sorts it using samtools sort. Before mapping, I've made sure all read pairs are intact (no singletons were present).

Now, when doing a samtools flagstat on the sorted BAM file, here's the output (with my comments afterwards):

200794 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
196324 + 0 mapped (97.77%:-nan%)  # call this mapped
200794 + 0 paired in sequencing # call this paired
100187 + 0 read1 # this is read1
100607 + 0 read2 # this is read2
193396 + 0 properly paired (96.32%:-nan%) # call this properly_paired
194097 + 0 with itself and mate mapped # this is mate_mapped
2227 + 0 singletons (1.11%:-nan%) # singletons
500 + 0 with mate mapped to a different chr #and this is mate_weird
328 + 0 with mate mapped to a different chr (mapQ>=5)

I then tried several samtools view command with the -c and -f flags to see if I understand the outputs correct, e.g.:

$ samtools view -c -f 0x2 file.bam # selects for properly paired reads
193396 (checks out with flagstat)

$ samtools view -c -f 0x40 file.bam # selects for first read
100187 (checks out with flagstat read1)

$ samtools view -c -f 0x80 file.bam # selects for last read
100607 (checks out with flagstat read2)

$ samtools view -c -f 0x42 file.bam # selects for properly paired first read
96702

$ samtools view -c -f 0x82 file.bam # selects for properly paired last read
96694

Now, my questions are:

  1. Of the initial 174,904 FASTQ records fed into the mapper, why does flagstat show 200,794 reads? I thought perhaps this is because there may be split reads (reads mapped to different exons), but I'm not sure..

  2. How come there are different numbers of mapped of read1 and read2 both before and after I select for properly paired reads? Aren't they supposed to be the same?

  3. I did some counting just to figure out if the numbers check out, but I think I may have missed some. In particular, I don't understand these:

  4. total - mate_mapped is 6697, while singletons is 2227. Why the discrepancy (4470) and what are those sequences?

  5. mate_mapped - mate_weird is 193,597, while properly_paired is 193,396 (201 reads discrepancy). Why the discrepancy and what are those sequences?

Thanks in advance for the help :)!

bam samtools • 5.5k views
ADD COMMENTlink modified 6.2 years ago by Jeremy Leipzig18k • written 6.2 years ago by bow780

what did you tell gsnap to do with mulitple hits?

ADD REPLYlink written 6.2 years ago by Jeremy Leipzig18k

Hmm..nothing specific? I only used the -t, -N, -B 4, -A sam, -D and -d flags. (The output was indeed SAM, later compressed to BAM).

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by bow780
3
gravatar for Jeremy Leipzig
6.2 years ago by
Philadelphia, PA
Jeremy Leipzig18k wrote:

flagstat does not count reads, it counts alignments. The reporting verbiage could probably be better.

I suspect the gsnap multiple hit default is not -n 1, so you are seeing multiple hits.

ADD COMMENTlink written 6.2 years ago by Jeremy Leipzig18k
1

"flagstat does not count reads, it counts alignments. The reporting verbiage could probably be better."

This is what probably tripped me in the first place. After checking the code and looking deeper into the BAM file, I finally reconciled my numbers :). Thanks for pushing in the right direction!

ADD REPLYlink written 6.2 years ago by bow780
1
gravatar for Istvan Albert
6.2 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

All of these seeming discrepancies are just a reflection that the words singleton, proper pair may have have different definitions. As a rule I found that the categories corresponding to flags may not be mutually exclusive (disjoint readsets). For example does mate_mapped imply that the current read is also mapped?

Overall it looks like each read may be represented more than once in the dataset. Then I would recommend searching this site for the word flagstat, it will produce a number of helpful discussions.

ADD COMMENTlink written 6.2 years ago by Istvan Albert ♦♦ 80k

Indeed. I finally reconciled the numbers after playing further with different flags.

ADD REPLYlink written 6.2 years ago by bow780

Could you give more info about your reconciled numbers? Thanks a lot!

ADD REPLYlink written 4.6 years ago by xiaofeiwang19826620
Please log in to add an answer.
The thread is closed. No new answers may be added.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 657 users visited in the last hour