"Secondary", "Supplementary", "Duplicates" and "Paired in sequencing" in samtools flagstat
1
0
Entering edit mode
2.1 years ago
pasha64t • 0

I wonder if someone please explain what secondary, supplementary, duplicates and paired in sequencing mean in samtools flagstat

Example:

194492 + 0 in total (QC-passed reads + QC-failed reads)
80 + 0 secondary
0 + 0 supplementary
0 + 0 duplicates
193804 + 0 mapped (99.65% : N/A)
194412 + 0 paired in sequencing
97206 + 0 read1
97206 + 0 read2
190812 + 0 properly paired (98.15% : N/A)
193108 + 0 with itself and mate mapped
616 + 0 singletons (0.32% : N/A)
0 + 0 with mate mapped to a different chr
0 + 0 with mate mapped to a different chr (mapQ>=5)
samtools • 2.2k views
ADD COMMENT
1
Entering edit mode
ADD REPLY
0
Entering edit mode

Note that samtools flagstat is only reading what flags are set. If the software you used to align doesn't ever apply the duplicate flag, then it won't ever be set, even if your sample has duplicates.

ADD REPLY
0
Entering edit mode

See also the flagstat man page, which describes each of these in terms of the FLAG bits that categorise it.

ADD REPLY
3
Entering edit mode
2.1 years ago

I will enumerate these mostly for future reference, it is not so easy to find this information in a readable form.

Plus, as it turns out it is not so easy to write these out in a more simplified language - I might be wrong actually, so I welcome corrections.

The fault is with the flag concept as it is confusing, non-intuitive and needlessly complicated:

  1. paired means that at the time of the alignment both read pairs were present and the aligner presumable assessed both when finding the most likely location (called mate rescue). The flag does not necessarily mean that the pair is also present in the BAM file, the BAM file may be post-processed.
  2. proper pair is a flag with no precise requirement. The aligner decides when to set the flag based on the discretion of the designer. Usually, it means that the read pairs have a certain orientation, both reads align, and the read pairs are within a certain distance.
  3. duplicate means that the read or template sequence has been identified as non-unique. It states that the alignment file contains at least one more read or template with an identical sequence. Typically a different software needs to be run to detect and mark duplicates and the process may detect identities of reads or read-pairs (templates). The duplication may be decided by sequence identity or by alignment identity.
  4. secondary alignments represent multiple alignments of a read. Usually, only those secondary alignments are reported that are not overlapping and do not cover the entire read. A read that fully matches with identical scores in multiple locations typically may not have all secondary alignments listed. Instead, the alignment quality will be zero and the alternative locations will be indicated in the SA tag. Secondary alignments usually represent partial alignments of a read in different locations of the genome. There is a lot leeway in how aligners report alternative alignments.
  5. supplementary alignments are what are called "chimeric" alignments. These are alignments that cover the entire read but do not follow consecutively in a linear fashion. Only a subset of aligners can detect chimeric alignments.

Take for example the sequence AAATTTGGGCCC that produces two alignments at 1000 and 2000

When the alignments are non-overlapping one alignment will be marked as secondary:

   10000   AATT   primary      
   20000   GCCC   secondary

If the alignments are non-linear and the aligned regions can be joined to cover the entire read then the alignment would be represented as supplementary like so:

   10000 GGGCCC  primary
   20000 AAATTT  supplementary

Annoying there is no "primary" flag, I am just listing like so for clarity. An alignment is primary if it is neither secondary nor supplementary...

Now what happens if we have this:

   10000 AAATTT     
   20000 GGGCCC  

the reported alignment may be that the second is marked as secondary or that only a single alignment is reported as a spliced alignment

    AAATTTNNN...NNNGGGCCCC

The SAM specification can be read at:

https://samtools.github.io/hts-specs/SAMv1.pdf

SAM tags:

https://samtools.github.io/hts-specs/SAMtags.pdf

ADD COMMENT

Login before adding your answer.

Traffic: 1672 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6