Interpreting Flagstat Output
2
2
Entering edit mode
8.2 years ago
roll ▴ 330

Hi,

This question has probably been asked many times but it is still not very clear to me from what I read online.

For example I have two different outputs from samtools flagstat. Comparing these two samples, which one is better or worse? or what makes it good or bad? what i should look for in the flagstat output?

First One

137147 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
137147 + 0 mapped (100.00%:nan%)
137147 + 0 paired in sequencing
44296 + 0 properly paired (32.30%:nan%)
57734 + 0 with itself and mate mapped
79413 + 0 singletons (57.90%:nan%)
14598 + 0 with mate mapped to a different chr
668 + 0 with mate mapped to a different chr (mapQ>=5)


Second one:

120672 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
120672 + 0 mapped (100.00%:nan%)
120672 + 0 paired in sequencing
38796 + 0 properly paired (32.15%:nan%)
47462 + 0 with itself and mate mapped
73210 + 0 singletons (60.67%:nan%)
10266 + 0 with mate mapped to a different chr
508 + 0 with mate mapped to a different chr (mapQ>=5)


Furthermore, tophat alignment summary.txt file made me even more confusing. It sounds like the quality of the mapping is very bad?

Left reads:
Input:   3118050
Mapped:     24264 ( 0.8% of input)
of these:     12910 (53.2%) have multiple alignments (1 have >1000)
Input:   3118050
Mapped:     23492 ( 0.8% of input)
of these:     13094 (55.7%) have multiple alignments (2 have >1000)

Aligned pairs:      9112
of these:      4163 (45.7%) have multiple alignments
and:       606 ( 6.7%) are discordant alignments
0.3% concordant pair alignment rate.

samtools qualitycontrol tophat2 alignment • 6.5k views
0
Entering edit mode

Those are both pretty bad, assuming you're not expecting a lot of rearrangements. Did you not rRNA deplete the samples? Also, did you trim these prior to alignment?

0
Entering edit mode

Only trimming I did was to remove calls whose Phred score is lower than 20. I also allowed the aligner to find the multi-hits as i am mainly interested in finding repeats.

And how did you decide that this is bad? Is it because of the properly paired rate is too low?

What can i do to improve this?

0
Entering edit mode

Since you mentioned elsewhere using tophat2, I assumed you were looking for differentially expressed genes (the alignment in that case should produce fewer multimappers). If you're looking for repeats, then those are probably much better results (the two files from which you pasted output from flagstat are largely equivalent).

0
Entering edit mode

If by allowing the aligner to find multi-hits you mean by using the -g option, I am afraid that, at least with some TopHat2 versions, you are not going to change your "overall read alignment rate" in align_summary file. If you want to get more new alignments maybe you should try with the bowtie or fix_mapping_ordering parameters (minscore, read_edit_distance, and so on)

0
Entering edit mode
8.2 years ago

From the flagstat files I would say both mappings are quite similar, regarding the percentage of properly mapped pairs and singletons.

The align_summary file shows very low mapping rate. If its good or bad it depends on your organisms query and reference, nature of the experiment, ... but I would say, in general terms, that is a very poor mapping result.

0
Entering edit mode
8.2 years ago

Are you sure that you are using the right reference sequence? Or the right software? Only 1% of your reads aligned. Your library is fundamentally messed up, or one of your assumptions about how to handle this data is incorrect, you need to check all that.

Have you looked at the sequences of your unmapped reads, to see what they are?

You can use this bit of code to generate a list of each unmapped sequence in your indexed .bam, and how often it turned up. The list will be sorted so that the most abundant unmapped read is at the top. See if something like adapter is eating up all your reads. This takes some time to run, so be patient.

samtools view -f 4 mybam.bam | cut -f 10 | sort | uniq -c | sort -nr > unmapped_reads.txt

0
Entering edit mode

I found adapters using fastqc and I removed adapters and the mapping improved it a lot. Now i am getting %49 mapped. but still is low i think. I ran your code it gives me similar results as in fastqc's overrepresented sequences but I don't know how to use this information. Any suggestions?