Tutorial: Sam File Format - Lesser Known Tips And Tricks
gravatar for Ya
6.3 years ago by
Ya290 wrote:

First the defintion of the Sequence Alignment/Map (SAM). It is aTAB-delimited. Apart from the header lines, which are started with the ‘@’ symbol, each alignment line consists of:

Column Fields Description

  1. QNAME Query template/pair NAME
  2. FLAG bitwise FLAG
  3. RNAME Reference sequence NAME
  4. POS 1-based leftmost POSition/coordinate of clipped sequence
  5. MAPQ MAPping Quality (Phred-scaled)
  6. CIGAR extended CIGAR string
  7. MRNM Mate Reference sequence NaMe (‘=’ if same as RNAME)
  8. MPOS 1-based Mate POSistion
  9. LEN inferred Template LENgth (insert size)
  10. SEQ query SEQuence on the same strand as the reference
  11. QUAL query QUALity (ASCII-33 gives the Phred base quality)
  12. OPT variable OPTional fields in the format TAG:VTYPE:VALUE
tutorial format sam • 18k views
ADD COMMENTlink modified 6.3 years ago by Ying W3.9k • written 6.3 years ago by Ya290

Let's use this thread to add information on the SAM format that may not always be obvious or well documented.

ADD REPLYlink written 6.3 years ago by Istvan Albert ♦♦ 80k
gravatar for Istvan Albert
6.3 years ago by
Istvan Albert ♦♦ 80k
University Park, USA
Istvan Albert ♦♦ 80k wrote:

Common challenges

  1. If the reverse strand flag is set in column 2 (FLAG) ( with value 4) then the sequence reported in column 10 (SEQ) will be the reverse complement!
  2. The 5' end of reads on the forward strand correspond to colum 4 (POS). To find the 5' end of the read on the reverse strand you will need to use column 4 (POS) and add to that the length that you parse from the CIGAR string that indicates the length of the actual alignment. Writing custom code to do this properly and efficiently is a non-trivial task. One convenient approach is to convert to BED format via the bamtobed command of BedTools.
  3. The value of the CIGAR string (column 5) and the value of the edit strings in the column 12 options (OPTS) may be different. This depends on the aligner. The alignment process often contains of two steps: a fast heuristics and an optimal alignment. The two values may correspond to each of these processes.
  4. Color space aligners will usually produce a letter space sequence in the 10 (SEQ) column. Reverting that to the original color space representation is also challenging task (I know of no tools that can do that).
  5. Learn more about the value stored in column 5, mapping quality (MAPQ) at C: C: C: A: Why there are a lot of MQ0 reads in some particular regions?
  6. The SAM format is 1 based (like the GFF format). The BAM format is 0 based (like the BED format). Usually when viewing and processing BAM files we produce a SAM format from a BAM and the conversion is automatic. But if you read a BAM file directly you will need to account for the coordinate system differences.

Mathematical Model

See the mathematical models used in Samtools: http://www.broadinstitute.org/gatk/media/docs/Samtools.pdf

ADD COMMENTlink modified 6.2 years ago • written 6.3 years ago by Istvan Albert ♦♦ 80k
gravatar for Ying W
6.3 years ago by
Ying W3.9k
South San Francisco, CA
Ying W3.9k wrote:

This is a handy website that will explain what the sam flags mean (convert numbers into flags and vice versa)

An unofficial fork by the main developer has some changes that have not yet been integrated into samtools.

ADD COMMENTlink written 6.3 years ago by Ying W3.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1383 users visited in the last hour