What Does Samtools Flagstat Results Mean?
5
23
Entering edit mode
10.2 years ago
Sakti ▴ 480

Hi,

Once again, I'm bothering the knowledgeable people from this site :) I'm currently analyzing some RNAseq data from mouse, and after running TopHat on my paired-end files, I wanted to look at any form of program that could let me know how many reads from my dataset aligned, and how many of them properly aligned to the ref mouse genome annotation.

After performing the samtools flagstat on the accepted hits bam file, I get the following:

20968800 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
20968800 + 0 mapped (100.00%:nan%)
20968800 + 0 paired in sequencing
11431237 + 0 read1
9537563 + 0 read2
71098 + 0 properly paired (0.34%:nan%)
6633306 + 0 with itself and mate mapped
14335494 + 0 singletons (68.37%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I've been trying to look on how to interpret these numbers, and what does that properly paired (0.34%:nan%) means, but so far I haven't been lucky. I don't really know how to interpret the final flagstat output, and whether my alignment conditions worked at all.

I'd appreciate any piece of information on this issue, thanks!!!!

rna samtools mouse • 75k views
ADD COMMENT
0
Entering edit mode

There are question marks about your stats for me. Your number of forward and reverse reads are very disparate and only a tiny fraction of your reads map properly in pairs.

ADD REPLY
32
Entering edit mode
10.2 years ago

Here is the C code in samtools:

typedef struct {
    long long n_reads[2], n_mapped[2], n_pair_all[2], n_pair_map[2], n_pair_good[2];
    long long n_sgltn[2], n_read1[2], n_read2[2];
    long long n_dup[2];
    long long n_diffchr[2], n_diffhigh[2];
} bam_flagstat_t;

 (...)
    printf("%lld + %lld in total (QC-passed reads + QC-failed reads)\n", s->n_reads[0], s->n_reads[1]);
    printf("%lld + %lld duplicates\n", s->n_dup[0], s->n_dup[1]);
    printf("%lld + %lld mapped (%.2f%%:%.2f%%)\n", s->n_mapped[0], s->n_mapped[1], (float)s->n_mapped[0] / s->n_reads[0] * 100.0, (float)s->n_mapped[1] / s->n_reads[1] * 100.0);
    printf("%lld + %lld paired in sequencing\n", s->n_pair_all[0], s->n_pair_all[1]);
    printf("%lld + %lld read1\n", s->n_read1[0], s->n_read1[1]);
    printf("%lld + %lld read2\n", s->n_read2[0], s->n_read2[1]);
    printf("%lld + %lld properly paired (%.2f%%:%.2f%%)\n", s->n_pair_good[0], s->n_pair_good[1], (float)s->n_pair_good[0] / s->n_pair_all[0] * 100.0, (float)s->n_pair_good[1] / s->n_pair_all[1] * 100.0);
    printf("%lld + %lld with itself and mate mapped\n", s->n_pair_map[0], s->n_pair_map[1]);
    printf("%lld + %lld singletons (%.2f%%:%.2f%%)\n", s->n_sgltn[0], s->n_sgltn[1], (float)s->n_sgltn[0] / s->n_pair_all[0] * 100.0, (float)s->n_sgltn[1] / s->n_pair_all[1] * 100.0);
    printf("%lld + %lld with mate mapped to a different chr\n", s->n_diffchr[0], s->n_diffchr[1]);
    printf("%lld + %lld with mate mapped to a different chr (mapQ>=5)\n", s->n_diffhigh[0], s->n_diffhigh[1]);
(...)
        ++(s)->n_reads[w];                                                \
        if ((c)->flag & BAM_FPAIRED) {                                    \
            ++(s)->n_pair_all[w];                                        \
            if ((c)->flag & BAM_FPROPER_PAIR) ++(s)->n_pair_good[w];    \
            if ((c)->flag & BAM_FREAD1) ++(s)->n_read1[w];                \
            if ((c)->flag & BAM_FREAD2) ++(s)->n_read2[w];                \
            if (((c)->flag & BAM_FMUNMAP) && !((c)->flag & BAM_FUNMAP)) ++(s)->n_sgltn[w];    \
            if (!((c)->flag & BAM_FUNMAP) && !((c)->flag & BAM_FMUNMAP)) { \
                ++(s)->n_pair_map[w];                                    \
                if ((c)->mtid != (c)->tid) {                            \
                    ++(s)->n_diffchr[w];                                \
                    if ((c)->qual >= 5) ++(s)->n_diffhigh[w];            \
                }
  • [2] is an array of two elements storing the number of reads and the number of 'QC-failed reads'.
  • 'NAN' means 'Not A Number' (e.g: div by 0)
  • n_reads are the total number of reads
  • n_pair_all : the read is paired in sequencing, no matter whether it is mapped in a pair
  • n_pair_good : the read is mapped in a proper pair
  • n_read1 : count read1
  • n_read2 : count read2
  • n_sgltn : the read itself is unmapped the mate is mapped
  • n_pair_map: the read itself is mapped the mate is unmapped
  • n_diffchr: number of reads with a mate mapped on a different chromosome
  • n_diffhigh: number of reads with a mate on a different chromosome having a quality greater than 5

for more information, see the spec

ADD COMMENT
3
Entering edit mode

I'm pretty sure 'total' is the total number of alignments (lines in the sam file), not total reads.

ADD REPLY
1
Entering edit mode

Hi,

are you sure about "n_pair_map: the read itself is mapped the mate is unmapped" ? I thought, this corresponds to the case where both ends are mapped

ADD REPLY
1
Entering edit mode

Why did you skip n_mapped? Specifically, what is the minimum mapping Q of these mapped reads?

Highest regards.

ADD REPLY
0
Entering edit mode

This is very useful, thanks Pierre!!!!

ADD REPLY
0
Entering edit mode

What about duplicates? If a read aligns several places. Is this indicated in the 0 + 0 duplicates line?

ADD REPLY
0
Entering edit mode

Would you be able to match your descriptions of these variables with the actual samtools flagstat output shown in the original post? Despite your thorough writeup, I still cannot clearly figure out what any of the numbers are supposed to mean. The C code is not readable at all so the variable names displayed do not help much in actually interpreting the flagstat messages. Thanks.

ADD REPLY
2
Entering edit mode
8.5 years ago
rocky.singh ▴ 50

I'm currently analyzing some RNAseq data for rice samples, and did alignment with BWA, removed PCR duplicates and later call SNPs with SAMtools on my paired-end files, I wanted to look at any form of program that could let me know how many reads from my dataset aligned, and how many of them properly aligned to the ref rice genome annotation. When running samtools flagstat I got this output. Is this a good alignment.

23209893 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
23209893 + 0 mapped (100.00%:nan%)
23209893 + 0 paired in sequencing
11592951 + 0 read1
11616942 + 0 read2
22340050 + 0 properly paired (96.25%:nan%)
22636581 + 0 with itself and mate mapped
573312 + 0 singletons (2.47%:nan%)
20032 + 0 with mate mapped to a different chr
20032 + 0 with mate mapped to a different chr (mapQ>=5)
ADD COMMENT
2
Entering edit mode

please, fix your style ( bold= SHOUT) and post this as a new question.

ADD REPLY
2
Entering edit mode
5.4 years ago
brent_wilson ▴ 130

Sometimes the cause of massive amounts of improper pairing is due to your reads being out of order in the reads files (assuming you have separate R1 and R2 files). This can easily happen if you are filtering out reads from your files and both reads of a pair are not eliminated (ex: R2 loses read #1,000 but R1 doesn't lose a read until read #9,000).

ADD COMMENT
1
Entering edit mode
10.2 years ago
Marvin ▴ 850

"Properly paired" means both mates of a read pair map to the same chromosome, oriented towards each other, and with a sensible insert size. The overwhelming majority of your paired-end reads should be "properly paired", your 0.34% look very wrong.

Moreover, since all you reads are "paired", but the number of "first mates" doesn't match the number of "second mates", I guess you're looking at a file where the unmapped reads have been left out. The numbers might make more sense if you could get the number of raw reads (including "unmapped") from somewhere.

ADD COMMENT
0
Entering edit mode

I see, then maybe it's related to the QC processing I made, which somehow altered the number of reads left. Thanks Marvin!

ADD REPLY

Login before adding your answer.

Traffic: 2832 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6