Question: What Does Samtools Flagstat Results Mean?
13
gravatar for Sakti
6.8 years ago by
Sakti350
United States
Sakti350 wrote:

Hi,

Once again, I'm bothering the knowledgeable people from this site :) I'm currently analyzing some RNAseq data from mouse, and after running TopHat on my paired-end files, I wanted to look at any form of program that could let me know how many reads from my dataset aligned, and how many of them properly aligned to the ref mouse genome annotation.

After performing the samtools flagstat on the accepted hits bam file, I get the following:

20968800 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
20968800 + 0 mapped (100.00%:nan%)
20968800 + 0 paired in sequencing
11431237 + 0 read1
9537563 + 0 read2
71098 + 0 properly paired (0.34%:nan%)
6633306 + 0 with itself and mate mapped
14335494 + 0 singletons (68.37%:nan%)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)

I've been trying to look on how to interpret these numbers, and what does that properly paired (0.34%:nan%) means, but so far I haven't been lucky. I don't really know how to interpret the final flagstat output, and whether my alignment conditions worked at all.

I'd appreciate any piece of information on this issue, thanks!!!!

rna samtools mouse • 49k views
ADD COMMENTlink modified 24 months ago by brent_wilson70 • written 6.8 years ago by Sakti350

There are question marks about your stats for me. Your number of forward and reverse reads are very disparate and only a tiny fraction of your reads map properly in pairs.

ADD REPLYlink written 6.8 years ago by Travis2.8k
19
gravatar for Pierre Lindenbaum
6.8 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum109k wrote:

Here is the C code in samtools:

typedef struct {
    long long n_reads[2], n_mapped[2], n_pair_all[2], n_pair_map[2], n_pair_good[2];
    long long n_sgltn[2], n_read1[2], n_read2[2];
    long long n_dup[2];
    long long n_diffchr[2], n_diffhigh[2];
} bam_flagstat_t;

 (...)
    printf("%lld + %lld in total (QC-passed reads + QC-failed reads)\n", s->n_reads[0], s->n_reads[1]);
    printf("%lld + %lld duplicates\n", s->n_dup[0], s->n_dup[1]);
    printf("%lld + %lld mapped (%.2f%%:%.2f%%)\n", s->n_mapped[0], s->n_mapped[1], (float)s->n_mapped[0] / s->n_reads[0] * 100.0, (float)s->n_mapped[1] / s->n_reads[1] * 100.0);
    printf("%lld + %lld paired in sequencing\n", s->n_pair_all[0], s->n_pair_all[1]);
    printf("%lld + %lld read1\n", s->n_read1[0], s->n_read1[1]);
    printf("%lld + %lld read2\n", s->n_read2[0], s->n_read2[1]);
    printf("%lld + %lld properly paired (%.2f%%:%.2f%%)\n", s->n_pair_good[0], s->n_pair_good[1], (float)s->n_pair_good[0] / s->n_pair_all[0] * 100.0, (float)s->n_pair_good[1] / s->n_pair_all[1] * 100.0);
    printf("%lld + %lld with itself and mate mapped\n", s->n_pair_map[0], s->n_pair_map[1]);
    printf("%lld + %lld singletons (%.2f%%:%.2f%%)\n", s->n_sgltn[0], s->n_sgltn[1], (float)s->n_sgltn[0] / s->n_pair_all[0] * 100.0, (float)s->n_sgltn[1] / s->n_pair_all[1] * 100.0);
    printf("%lld + %lld with mate mapped to a different chr\n", s->n_diffchr[0], s->n_diffchr[1]);
    printf("%lld + %lld with mate mapped to a different chr (mapQ>=5)\n", s->n_diffhigh[0], s->n_diffhigh[1]);
(...)
        ++(s)->n_reads[w];                                                \
        if ((c)->flag & BAM_FPAIRED) {                                    \
            ++(s)->n_pair_all[w];                                        \
            if ((c)->flag & BAM_FPROPER_PAIR) ++(s)->n_pair_good[w];    \
            if ((c)->flag & BAM_FREAD1) ++(s)->n_read1[w];                \
            if ((c)->flag & BAM_FREAD2) ++(s)->n_read2[w];                \
            if (((c)->flag & BAM_FMUNMAP) && !((c)->flag & BAM_FUNMAP)) ++(s)->n_sgltn[w];    \
            if (!((c)->flag & BAM_FUNMAP) && !((c)->flag & BAM_FMUNMAP)) { \
                ++(s)->n_pair_map[w];                                    \
                if ((c)->mtid != (c)->tid) {                            \
                    ++(s)->n_diffchr[w];                                \
                    if ((c)->qual >= 5) ++(s)->n_diffhigh[w];            \
                }
  • [2] is an array of two elements storing the number of reads and the number of 'QC-failed reads' .
  • 'NAN' means 'Not A Number' (e.g: div by 0 )
  • n_reads are the total number of reads
  • n_pair_all : the read is paired in sequencing, no matter whether it is mapped in a pair
  • n_pair_good : the read is mapped in a proper pair
  • n_read1 : count read1
  • n_read2 : count read2
  • n_sgltn : the read itself is unmapped the mate is mapped
  • n_pair_map: the read itself is mapped the mate is unmapped
  • n_diffchr: number of reads with a mate mapped on a different chromosome
  • n_diffhigh: number of reads with a mate on a different chromosome having a quality greater than 5

for more information, see the spec http://samtools.sourceforge.net/SAM1.pdf

ADD COMMENTlink modified 4.4 years ago • written 6.8 years ago by Pierre Lindenbaum109k
3

I'm pretty sure 'total' is the total number of alignments (lines in the sam file), not total reads.

ADD REPLYlink written 4.7 years ago by Ketil3.9k
1

Hi,

are you sure about "n_pair_map: the read itself is mapped the mate is unmapped" ? I thought, this corresponds to the case where both ends are mapped

ADD REPLYlink written 6.5 years ago by User 166620
1

Why did you skip n_mapped? Specifically, what is the minimum mapping Q of these mapped reads?

Highest regards.

ADD REPLYlink written 7 months ago by Freek10

This is very useful, thanks Pierre!!!!

ADD REPLYlink written 6.8 years ago by Sakti350

What about duplicates? If a read aligns several places. Is this indicated in the 0 + 0 duplicates line?

ADD REPLYlink written 13 months ago by jon.brate160
2
gravatar for rocky.singh
5.1 years ago by
rocky.singh50
rocky.singh50 wrote:

I'm currently analyzing some RNAseq data for rice samples, and did alignment with BWA, removed PCR duplicates and later call SNPs with SAMtools on my paired-end files, I wanted to look at any form of program that could let me know how many reads from my dataset aligned, and how many of them properly aligned to the ref rice genome annotation. When running samtools flagstat I got this output. Is this a good alignment.

23209893 + 0 in total (QC-passed reads + QC-failed reads)

0 + 0 duplicates

23209893 + 0 mapped (100.00%:nan%)

23209893 + 0 paired in sequencing

11592951 + 0 read1

11616942 + 0 read2

22340050 + 0 properly paired (96.25%:nan%)

22636581 + 0 with itself and mate mapped

573312 + 0 singletons (2.47%:nan%)

20032 + 0 with mate mapped to a different chr

20032 + 0 with mate mapped to a different chr (mapQ>=5)

ADD COMMENTlink modified 3.7 years ago by Leonor Palmeira3.6k • written 5.1 years ago by rocky.singh50
1

please, fix your style ( bold= SHOUT) and post this as a new question.

ADD REPLYlink written 5.1 years ago by Pierre Lindenbaum109k
1
gravatar for Marvin
6.8 years ago by
Marvin800
Marvin800 wrote:

"Properly paired" means both mates of a read pair map to the same chromosome, oriented towards each other, and with a sensible insert size. The overwhelming majority of your paired-end reads should be "properly paired", your 0.34% look very wrong.

Moreover, since all you reads are "paired", but the number of "first mates" doesn't match the number of "second mates", I guess you're looking at a file where the unmapped reads have been left out. The numbers might make more sense if you could get the number of raw reads (including "unmapped") from somewhere.

ADD COMMENTlink written 6.8 years ago by Marvin800

I see, then maybe it's related to the QC processing I made, which somehow altered the number of reads left. Thanks Marvin!

ADD REPLYlink written 6.8 years ago by Sakti350
0
gravatar for brent_wilson
24 months ago by
brent_wilson70
Cofactor Genomics (St. Louis, MO)
brent_wilson70 wrote:

Sometimes the cause of massive amounts of improper pairing is due to your reads being out of order in the reads files (assuming you have separate R1 and R2 files). This can easily happen if you are filtering out reads from your files and both reads of a pair are not eliminated (ex: R2 loses read #1,000 but R1 doesn't lose a read until read #9,000).

ADD COMMENTlink written 24 months ago by brent_wilson70
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 931 users visited in the last hour