Question

Interpreting Flagstat Output

2

Entering edit mode

11.8 years ago

roll ▴ 350

Hi,

This question has probably been asked many times but it is still not very clear to me from what I read online.

For example I have two different outputs from samtools flagstat. Comparing these two samples, which one is better or worse? or what makes it good or bad? what i should look for in the flagstat output?

First One

137147 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
137147 + 0 mapped (100.00%:nan%)
137147 + 0 paired in sequencing
66552 + 0 read1
70595 + 0 read2
44296 + 0 properly paired (32.30%:nan%)
57734 + 0 with itself and mate mapped
79413 + 0 singletons (57.90%:nan%)
14598 + 0 with mate mapped to a different chr
668 + 0 with mate mapped to a different chr (mapQ>=5)

Second one:

120672 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
120672 + 0 mapped (100.00%:nan%)
120672 + 0 paired in sequencing
59601 + 0 read1
61071 + 0 read2
38796 + 0 properly paired (32.15%:nan%)
47462 + 0 with itself and mate mapped
73210 + 0 singletons (60.67%:nan%)
10266 + 0 with mate mapped to a different chr
508 + 0 with mate mapped to a different chr (mapQ>=5)

Furthermore, tophat alignment summary.txt file made me even more confusing. It sounds like the quality of the mapping is very bad?

Left reads:
           Input:   3118050
          Mapped:     24264 ( 0.8% of input)
        of these:     12910 (53.2%) have multiple alignments (1 have >1000)
Right reads:
           Input:   3118050
          Mapped:     23492 ( 0.8% of input)
        of these:     13094 (55.7%) have multiple alignments (2 have >1000)
 0.8% overall read alignment rate.

Aligned pairs:      9112
 of these:      4163 (45.7%) have multiple alignments
      and:       606 ( 6.7%) are discordant alignments
 0.3% concordant pair alignment rate.

samtools qualitycontrol tophat2 alignment • 9.1k views

ADD COMMENT • link updated 11.8 years ago by swbarnes2 15k • written 11.8 years ago by roll ▴ 350

0

Entering edit mode

Those are both pretty bad, assuming you're not expecting a lot of rearrangements. Did you not rRNA deplete the samples? Also, did you trim these prior to alignment?

ADD REPLY • link 11.8 years ago by Devon Ryan 105k

0

Entering edit mode

Only trimming I did was to remove calls whose Phred score is lower than 20. I also allowed the aligner to find the multi-hits as i am mainly interested in finding repeats.

And how did you decide that this is bad? Is it because of the properly paired rate is too low?

What can i do to improve this?

ADD REPLY • link 11.8 years ago by roll ▴ 350

0

Entering edit mode

Since you mentioned elsewhere using tophat2, I assumed you were looking for differentially expressed genes (the alignment in that case should produce fewer multimappers). If you're looking for repeats, then those are probably much better results (the two files from which you pasted output from flagstat are largely equivalent).

ADD REPLY • link 11.8 years ago by Devon Ryan 105k

0

Entering edit mode

If by allowing the aligner to find multi-hits you mean by using the -g option, I am afraid that, at least with some TopHat2 versions, you are not going to change your "overall read alignment rate" in align_summary file. If you want to get more new alignments maybe you should try with the bowtie or fix_mapping_ordering parameters (minscore, read_edit_distance, and so on)

ADD REPLY • link 11.8 years ago by cpcantalapiedra ▴ 170

score 0 · Answer 1 · 2013-10-02

From the flagstat files I would say both mappings are quite similar, regarding the percentage of properly mapped pairs and singletons.

The align_summary file shows very low mapping rate. If its good or bad it depends on your organisms query and reference, nature of the experiment, ... but I would say, in general terms, that is a very poor mapping result.

score 0 · Answer 2 · 2013-10-03

Are you sure that you are using the right reference sequence? Or the right software? Only 1% of your reads aligned. Your library is fundamentally messed up, or one of your assumptions about how to handle this data is incorrect, you need to check all that.

Have you looked at the sequences of your unmapped reads, to see what they are?

You can use this bit of code to generate a list of each unmapped sequence in your indexed .bam, and how often it turned up. The list will be sorted so that the most abundant unmapped read is at the top. See if something like adapter is eating up all your reads. This takes some time to run, so be patient.

samtools view -f 4 mybam.bam | cut -f 10 | sort | uniq -c | sort -nr > unmapped_reads.txt