Question: Interpreting Flagstat Output
2
gravatar for roll
5.2 years ago by
roll240
United Kingdom
roll240 wrote:

Hi,

This question has probably been asked many times but it is still not very clear to me from what I read online.

For example I have two different outputs from samtools flagstat. Comparing these two samples, which one is better or worse? or what makes it good or bad? what i should look for in the flagstat output?

First One

137147 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
137147 + 0 mapped (100.00%:nan%)
137147 + 0 paired in sequencing
66552 + 0 read1
70595 + 0 read2
44296 + 0 properly paired (32.30%:nan%)
57734 + 0 with itself and mate mapped
79413 + 0 singletons (57.90%:nan%)
14598 + 0 with mate mapped to a different chr
668 + 0 with mate mapped to a different chr (mapQ>=5)

Second one:

120672 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
120672 + 0 mapped (100.00%:nan%)
120672 + 0 paired in sequencing
59601 + 0 read1
61071 + 0 read2
38796 + 0 properly paired (32.15%:nan%)
47462 + 0 with itself and mate mapped
73210 + 0 singletons (60.67%:nan%)
10266 + 0 with mate mapped to a different chr
508 + 0 with mate mapped to a different chr (mapQ>=5)

Furthermore, tophat alignment summary.txt file made me even more confusing. It sounds like the quality of the mapping is very bad?

Left reads:
           Input:   3118050
          Mapped:     24264 ( 0.8% of input)
        of these:     12910 (53.2%) have multiple alignments (1 have >1000)
Right reads:
           Input:   3118050
          Mapped:     23492 ( 0.8% of input)
        of these:     13094 (55.7%) have multiple alignments (2 have >1000)
 0.8% overall read alignment rate.

Aligned pairs:      9112
 of these:      4163 (45.7%) have multiple alignments
      and:       606 ( 6.7%) are discordant alignments
 0.3% concordant pair alignment rate.
ADD COMMENTlink modified 5.2 years ago by swbarnes24.6k • written 5.2 years ago by roll240

Those are both pretty bad, assuming you're not expecting a lot of rearrangements. Did you not rRNA deplete the samples? Also, did you trim these prior to alignment?

ADD REPLYlink written 5.2 years ago by Devon Ryan86k

Only trimming I did was to remove calls whose Phred score is lower than 20. I also allowed the aligner to find the multi-hits as i am mainly interested in finding repeats.

And how did you decide that this is bad? Is it because of the properly paired rate is too low?

What can i do to improve this?

ADD REPLYlink modified 5.2 years ago • written 5.2 years ago by roll240

Since you mentioned elsewhere using tophat2, I assumed you were looking for differentially expressed genes (the alignment in that case should produce fewer multimappers). If you're looking for repeats, then those are probably much better results (the two files from which you pasted output from flagstat are largely equivalent).

ADD REPLYlink written 5.2 years ago by Devon Ryan86k

If by allowing the aligner to find multi-hits you mean by using the -g option, I am afraid that, at least with some TopHat2 versions, you are not going to change your "overall read alignment rate" in align_summary file. If you want to get more new alignments maybe you should try with the bowtie or fix_mapping_ordering parameters (minscore, read_edit_distance, and so on)

ADD REPLYlink written 5.2 years ago by cpcantalapiedra140
0
gravatar for cpcantalapiedra
5.2 years ago by
Spain
cpcantalapiedra140 wrote:

From the flagstat files I would say both mappings are quite similar, regarding the percentage of properly mapped pairs and singletons.

The align_summary file shows very low mapping rate. If its good or bad it depends on your organisms query and reference, nature of the experiment, ... but I would say, in general terms, that is a very poor mapping result.

ADD COMMENTlink written 5.2 years ago by cpcantalapiedra140
0
gravatar for swbarnes2
5.2 years ago by
swbarnes24.6k
United States
swbarnes24.6k wrote:

Are you sure that you are using the right reference sequence? Or the right software? Only 1% of your reads aligned. Your library is fundamentally messed up, or one of your assumptions about how to handle this data is incorrect, you need to check all that.

Have you looked at the sequences of your unmapped reads, to see what they are?

You can use this bit of code to generate a list of each unmapped sequence in your indexed .bam, and how often it turned up. The list will be sorted so that the most abundant unmapped read is at the top. See if something like adapter is eating up all your reads. This takes some time to run, so be patient.

samtools view -f 4 mybam.bam | cut -f 10 | sort | uniq -c | sort -nr > unmapped_reads.txt
ADD COMMENTlink written 5.2 years ago by swbarnes24.6k

I found adapters using fastqc and I removed adapters and the mapping improved it a lot. Now i am getting %49 mapped. but still is low i think. I ran your code it gives me similar results as in fastqc's overrepresented sequences but I don't know how to use this information. Any suggestions?

ADD REPLYlink written 5.2 years ago by roll240
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1429 users visited in the last hour