What Does Samtools Flagstat Results Mean?
5
23
Entering edit mode
10.2 years ago
Sakti ▴ 480

Hi,

Once again, I'm bothering the knowledgeable people from this site :) I'm currently analyzing some RNAseq data from mouse, and after running TopHat on my paired-end files, I wanted to look at any form of program that could let me know how many reads from my dataset aligned, and how many of them properly aligned to the ref mouse genome annotation.

After performing the samtools flagstat on the accepted hits bam file, I get the following:

20968800 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
20968800 + 0 mapped (100.00%:nan%)
20968800 + 0 paired in sequencing
71098 + 0 properly paired (0.34%:nan%)
6633306 + 0 with itself and mate mapped
14335494 + 0 singletons (68.37%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)


I've been trying to look on how to interpret these numbers, and what does that properly paired (0.34%:nan%) means, but so far I haven't been lucky. I don't really know how to interpret the final flagstat output, and whether my alignment conditions worked at all.

I'd appreciate any piece of information on this issue, thanks!!!!

rna samtools mouse • 75k views
0
Entering edit mode

32
Entering edit mode
10.2 years ago

Here is the C code in samtools:

typedef struct {
long long n_reads[2], n_mapped[2], n_pair_all[2], n_pair_map[2], n_pair_good[2];
long long n_dup[2];
long long n_diffchr[2], n_diffhigh[2];
} bam_flagstat_t;

(...)
printf("%lld + %lld duplicates\n", s->n_dup[0], s->n_dup[1]);
printf("%lld + %lld mapped (%.2f%%:%.2f%%)\n", s->n_mapped[0], s->n_mapped[1], (float)s->n_mapped[0] / s->n_reads[0] * 100.0, (float)s->n_mapped[1] / s->n_reads[1] * 100.0);
printf("%lld + %lld paired in sequencing\n", s->n_pair_all[0], s->n_pair_all[1]);
printf("%lld + %lld properly paired (%.2f%%:%.2f%%)\n", s->n_pair_good[0], s->n_pair_good[1], (float)s->n_pair_good[0] / s->n_pair_all[0] * 100.0, (float)s->n_pair_good[1] / s->n_pair_all[1] * 100.0);
printf("%lld + %lld with itself and mate mapped\n", s->n_pair_map[0], s->n_pair_map[1]);
printf("%lld + %lld singletons (%.2f%%:%.2f%%)\n", s->n_sgltn[0], s->n_sgltn[1], (float)s->n_sgltn[0] / s->n_pair_all[0] * 100.0, (float)s->n_sgltn[1] / s->n_pair_all[1] * 100.0);
printf("%lld + %lld with mate mapped to a different chr\n", s->n_diffchr[0], s->n_diffchr[1]);
printf("%lld + %lld with mate mapped to a different chr (mapQ>=5)\n", s->n_diffhigh[0], s->n_diffhigh[1]);
(...)
if ((c)->flag & BAM_FPAIRED) {                                    \
++(s)->n_pair_all[w];                                        \
if ((c)->flag & BAM_FPROPER_PAIR) ++(s)->n_pair_good[w];    \
if (((c)->flag & BAM_FMUNMAP) && !((c)->flag & BAM_FUNMAP)) ++(s)->n_sgltn[w];    \
if (!((c)->flag & BAM_FUNMAP) && !((c)->flag & BAM_FMUNMAP)) { \
++(s)->n_pair_map[w];                                    \
if ((c)->mtid != (c)->tid) {                            \
++(s)->n_diffchr[w];                                \
if ((c)->qual >= 5) ++(s)->n_diffhigh[w];            \
}

• [2] is an array of two elements storing the number of reads and the number of 'QC-failed reads'.
• 'NAN' means 'Not A Number' (e.g: div by 0)
• n_reads are the total number of reads
• n_pair_all : the read is paired in sequencing, no matter whether it is mapped in a pair
• n_pair_good : the read is mapped in a proper pair
• n_read1 : count read1
• n_read2 : count read2
• n_sgltn : the read itself is unmapped the mate is mapped
• n_pair_map: the read itself is mapped the mate is unmapped
• n_diffchr: number of reads with a mate mapped on a different chromosome
• n_diffhigh: number of reads with a mate on a different chromosome having a quality greater than 5

3
Entering edit mode

I'm pretty sure 'total' is the total number of alignments (lines in the sam file), not total reads.

1
Entering edit mode

Hi,

are you sure about "n_pair_map: the read itself is mapped the mate is unmapped" ? I thought, this corresponds to the case where both ends are mapped

1
Entering edit mode

Why did you skip n_mapped? Specifically, what is the minimum mapping Q of these mapped reads?

Highest regards.

0
Entering edit mode

This is very useful, thanks Pierre!!!!

0
Entering edit mode

What about duplicates? If a read aligns several places. Is this indicated in the 0 + 0 duplicates line?

0
Entering edit mode

Would you be able to match your descriptions of these variables with the actual samtools flagstat output shown in the original post? Despite your thorough writeup, I still cannot clearly figure out what any of the numbers are supposed to mean. The C code is not readable at all so the variable names displayed do not help much in actually interpreting the flagstat messages. Thanks.

2
Entering edit mode
8.5 years ago
rocky.singh ▴ 50

I'm currently analyzing some RNAseq data for rice samples, and did alignment with BWA, removed PCR duplicates and later call SNPs with SAMtools on my paired-end files, I wanted to look at any form of program that could let me know how many reads from my dataset aligned, and how many of them properly aligned to the ref rice genome annotation. When running samtools flagstat I got this output. Is this a good alignment.

23209893 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
23209893 + 0 mapped (100.00%:nan%)
23209893 + 0 paired in sequencing
22340050 + 0 properly paired (96.25%:nan%)
22636581 + 0 with itself and mate mapped
573312 + 0 singletons (2.47%:nan%)
20032 + 0 with mate mapped to a different chr
20032 + 0 with mate mapped to a different chr (mapQ>=5)

2
Entering edit mode

please, fix your style ( bold= SHOUT) and post this as a new question.

2
Entering edit mode
5.4 years ago
brent_wilson ▴ 130

Sometimes the cause of massive amounts of improper pairing is due to your reads being out of order in the reads files (assuming you have separate R1 and R2 files). This can easily happen if you are filtering out reads from your files and both reads of a pair are not eliminated (ex: R2 loses read #1,000 but R1 doesn't lose a read until read #9,000).

1
Entering edit mode
10.2 years ago
Marvin ▴ 850

"Properly paired" means both mates of a read pair map to the same chromosome, oriented towards each other, and with a sensible insert size. The overwhelming majority of your paired-end reads should be "properly paired", your 0.34% look very wrong.

Moreover, since all you reads are "paired", but the number of "first mates" doesn't match the number of "second mates", I guess you're looking at a file where the unmapped reads have been left out. The numbers might make more sense if you could get the number of raw reads (including "unmapped") from somewhere.

0
Entering edit mode

I see, then maybe it's related to the QC processing I made, which somehow altered the number of reads left. Thanks Marvin!