Question

What Does Samtools Flagstat Results Mean?

27

Entering edit mode

13.1 years ago

Sakti ▴ 530

Hi,

Once again, I'm bothering the knowledgeable people from this site :) I'm currently analyzing some RNAseq data from mouse, and after running TopHat on my paired-end files, I wanted to look at any form of program that could let me know how many reads from my dataset aligned, and how many of them properly aligned to the ref mouse genome annotation.

After performing the samtools flagstat on the accepted hits bam file, I get the following:

20968800 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
20968800 + 0 mapped (100.00%:nan%)
20968800 + 0 paired in sequencing
11431237 + 0 read1
9537563 + 0 read2
71098 + 0 properly paired (0.34%:nan%)
6633306 + 0 with itself and mate mapped
14335494 + 0 singletons (68.37%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I've been trying to look on how to interpret these numbers, and what does that properly paired (0.34%:nan%) means, but so far I haven't been lucky. I don't really know how to interpret the final flagstat output, and whether my alignment conditions worked at all.

I'd appreciate any piece of information on this issue, thanks!!!!

rna samtools mouse • 92k views

ADD COMMENT • link updated 2.6 years ago by Pierre Lindenbaum 164k • written 13.1 years ago by Sakti ▴ 530

0

Entering edit mode

There are question marks about your stats for me. Your number of forward and reverse reads are very disparate and only a tiny fraction of your reads map properly in pairs.

ADD REPLY • link 13.1 years ago by Travis ★ 2.8k

Ram · Answer 1 · 2011-09-24

Here is the C code in samtools:

typedef struct {
    long long n_reads[2], n_mapped[2], n_pair_all[2], n_pair_map[2], n_pair_good[2];
    long long n_sgltn[2], n_read1[2], n_read2[2];
    long long n_dup[2];
    long long n_diffchr[2], n_diffhigh[2];
} bam_flagstat_t;

 (...)
    printf("%lld + %lld in total (QC-passed reads + QC-failed reads)\n", s->n_reads[0], s->n_reads[1]);
    printf("%lld + %lld duplicates\n", s->n_dup[0], s->n_dup[1]);
    printf("%lld + %lld mapped (%.2f%%:%.2f%%)\n", s->n_mapped[0], s->n_mapped[1], (float)s->n_mapped[0] / s->n_reads[0] * 100.0, (float)s->n_mapped[1] / s->n_reads[1] * 100.0);
    printf("%lld + %lld paired in sequencing\n", s->n_pair_all[0], s->n_pair_all[1]);
    printf("%lld + %lld read1\n", s->n_read1[0], s->n_read1[1]);
    printf("%lld + %lld read2\n", s->n_read2[0], s->n_read2[1]);
    printf("%lld + %lld properly paired (%.2f%%:%.2f%%)\n", s->n_pair_good[0], s->n_pair_good[1], (float)s->n_pair_good[0] / s->n_pair_all[0] * 100.0, (float)s->n_pair_good[1] / s->n_pair_all[1] * 100.0);
    printf("%lld + %lld with itself and mate mapped\n", s->n_pair_map[0], s->n_pair_map[1]);
    printf("%lld + %lld singletons (%.2f%%:%.2f%%)\n", s->n_sgltn[0], s->n_sgltn[1], (float)s->n_sgltn[0] / s->n_pair_all[0] * 100.0, (float)s->n_sgltn[1] / s->n_pair_all[1] * 100.0);
    printf("%lld + %lld with mate mapped to a different chr\n", s->n_diffchr[0], s->n_diffchr[1]);
    printf("%lld + %lld with mate mapped to a different chr (mapQ>=5)\n", s->n_diffhigh[0], s->n_diffhigh[1]);
(...)
        ++(s)->n_reads[w];                                                \
        if ((c)->flag & BAM_FPAIRED) {                                    \
            ++(s)->n_pair_all[w];                                        \
            if ((c)->flag & BAM_FPROPER_PAIR) ++(s)->n_pair_good[w];    \
            if ((c)->flag & BAM_FREAD1) ++(s)->n_read1[w];                \
            if ((c)->flag & BAM_FREAD2) ++(s)->n_read2[w];                \
            if (((c)->flag & BAM_FMUNMAP) && !((c)->flag & BAM_FUNMAP)) ++(s)->n_sgltn[w];    \
            if (!((c)->flag & BAM_FUNMAP) && !((c)->flag & BAM_FMUNMAP)) { \
                ++(s)->n_pair_map[w];                                    \
                if ((c)->mtid != (c)->tid) {                            \
                    ++(s)->n_diffchr[w];                                \
                    if ((c)->qual >= 5) ++(s)->n_diffhigh[w];            \
                }

[2] is an array of two elements storing the number of reads and the number of 'QC-failed reads'.
'NAN' means 'Not A Number' (e.g: div by 0)
n_reads are the total number of reads
n_pair_all : the read is paired in sequencing, no matter whether it is mapped in a pair
n_pair_good : the read is mapped in a proper pair
n_read1 : count read1
n_read2 : count read2
n_sgltn : the read itself is unmapped the mate is mapped
n_pair_map: the read itself is mapped the mate is unmapped
n_diffchr: number of reads with a mate mapped on a different chromosome
n_diffhigh: number of reads with a mate on a different chromosome having a quality greater than 5

for more information, see the spec

score 5 · Answer 2 · 2011-09-25

5

Entering edit mode

13.1 years ago

Marvin ▴ 900

"Properly paired" means both mates of a read pair map to the same chromosome, oriented towards each other, and with a sensible insert size. The overwhelming majority of your paired-end reads should be "properly paired", your 0.34% look very wrong.

Moreover, since all you reads are "paired", but the number of "first mates" doesn't match the number of "second mates", I guess you're looking at a file where the unmapped reads have been left out. The numbers might make more sense if you could get the number of raw reads (including "unmapped") from somewhere.

ADD COMMENT • link 13.1 years ago by Marvin ▴ 900

0

Entering edit mode

I see, then maybe it's related to the QC processing I made, which somehow altered the number of reads left. Thanks Marvin!

ADD REPLY • link 13.1 years ago by Sakti ▴ 530

score 3 · Answer 3 · 2016-07-26

Sometimes the cause of massive amounts of improper pairing is due to your reads being out of order in the reads files (assuming you have separate R1 and R2 files). This can easily happen if you are filtering out reads from your files and both reads of a pair are not eliminated (ex: R2 loses read #1,000 but R1 doesn't lose a read until read #9,000).

Ram · Answer 4 · 2013-06-03

I'm currently analyzing some RNAseq data for rice samples, and did alignment with BWA, removed PCR duplicates and later call SNPs with SAMtools on my paired-end files, I wanted to look at any form of program that could let me know how many reads from my dataset aligned, and how many of them properly aligned to the ref rice genome annotation. When running samtools flagstat I got this output. Is this a good alignment.

23209893 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
23209893 + 0 mapped (100.00%:nan%)
23209893 + 0 paired in sequencing
11592951 + 0 read1
11616942 + 0 read2
22340050 + 0 properly paired (96.25%:nan%)
22636581 + 0 with itself and mate mapped
573312 + 0 singletons (2.47%:nan%)
20032 + 0 with mate mapped to a different chr
20032 + 0 with mate mapped to a different chr (mapQ>=5)