In Sam Format, Clarify The Meaning Of The "0" Flag.
1
22
Entering edit mode
11.3 years ago
tflutre ▴ 560

Hello, I have 40,637 short sequences (probes) in a fastq file named "seq.fq".

• First, I mapped them against a reference genome (hg19) using BWA ("bwa aln ...").
• Then, I converted the alignments from suffix-array coordinates into chromosomal coordinates ("bwa samse ..."), and obtained the results into a SAM file named "seq_aln.sam".
• Finally, I counted the number of occurrences for each flag:

$grep -v "@" seq_aln.sam | awk -F"\t" 'BEGIN{print "flag\toccurrences"} {a[$2]++} END{for(i in a)print i"\t"a[i]}'

flag......occurrences

4.........3083

0.........19039

16.......18515

According to this page, the "4" flag means that the short sequence doesn't map onto the reference genome, and the "16" flag means that the short sequence does map on the reverse strand of the reference genome.

But, what does the "0" flag mean? According to this forum page, it means "the read is not paired and mapped, forward strand", which is unclear to me... Does it mean "it is not paired but it maps on forward strand"? Or "it is neither paired nor maps on forward strand"? Or "it is neither paired nor maps on any strand"?

At the end, does all this mean that I can work with only 18,515 short sequences out of 40,637?

sam • 31k views
62
Entering edit mode
11.3 years ago
Jts ★ 1.4k

When the flag field is 0, it means none of the bitwise flags specified in the SAM spec (on page 4) are set. That means that your reads with flag 0 are unpaired (because the first flag, 0x1, is not set), successfully mapped to the reference (because 0x4 is not set) and mapped to the forward strand (because 0x10 is not set).

Summarizing your data, the reads with flag 4 are unmapped, the reads with flag 0 are mapped to the forward strand and the reads with flag 16 are mapped to the reverse strand.

11
Entering edit mode

2
Entering edit mode

I agree with tflutre, this should be added to the SAM documentation. I work with influenza virus, a virus with an RNA genome in the negative-sense orientation. I expected my cDNA library would contain reads mapping to the reference sequence in both sense and antisense orientation, and I was getting flags of 0 and 16. The reads flagged as 0 certainly mapped places, but it would be nice to have some good documentation explaining this. Thanks!

1
Entering edit mode

@Jts, can you comment on Devon Ryan's comment in the answer to my question? I actually realize I agree with him on it that FLAG 0 only mean it's not reverse complemented, not necessarily mapped to forward/+ strand.

A flag field of 0 may or may not mean that a read is actually mapped to the + strand. All it means is that it wasn't reverse complemented. Whether that means it should be assigned to one strand or the other or none at all is dependent on other factors and isn't encoded in BAM files. As an example, most common single-end RNAseq experiments use a stranded protocol wherein alignments with a flag of 0 come from the "-" strand.

0
Entering edit mode

@Jts has not been seen on Biostars for 4+ years, just so you know ....

0
Entering edit mode

@ genomax2, thanks for the note...