SAM flag and select reads that map uniquely
1
2
Entering edit mode
6.2 years ago
epigene ▴ 550

As I understand it, FLAG 0 means a read is mapped to the forward strand and FLAG 16 for reverse strand.

If I have single-end reads and I want to get all the uniquely mapped reads. Could I just do something like this:

samtools view bamfile.bam | awk '$2 == 0 ||$2 == 16'


Does FLAG 0 and 16 or any other FLAG values imply uniqueness of mapping?

Can I use samtools only with -F or -f to get alignment records with FLAG of either 0 or 16?

samtools • 6.7k views
0
Entering edit mode

I realized that I couldn't get uniquely mapped reads from samtools view bamfile.bam | awk '$2 == 0 ||$2 == 16'. , what the command does is to get primary alignments, which contain uniquely mapped and primary alignments from multi-mapping reads.

5
Entering edit mode
6.2 years ago

There is no flag for "mapped to the forward strand". Rather, if bit 16 is set, then then the read is mapped as a reverse complement (thus, if it's unset, then it's not mapped as a reverse complement). Whether "mapped as a reverse complement" can be taken to mean "maps to the minus strand" will be specific to how the library was prepared.

BAM/SAM/CRAM files have no concept of uniqueness, in part because there are multiple ways of defining "unique". To make a long story short, forget the idea of "unique mapping" and adopt the idea of "likely correct mapping". You can then define a threshold for how likely the alignment should be and filter by MAPQ accordingly (note that there's not a great relationship between the MAPQ scores in BAM file and the mathematical definition of what that should represent).

Yes, you can use -F 16 or -f 16 to get alignments with bit 16 unset or set, respectively.

0
Entering edit mode

isn't 0 the flag for "mapped to the forward strand". I got it from this post: In Sam Format, Clarify The Meaning Of The "0" Flag.

About uniqueness, thanks for the clarification!

0
Entering edit mode

A flag field of 0 may or may not mean that a read is actually mapped to the + strand. All it means is that it wasn't reverse complemented. Whether that means it should be assigned to one strand or the other or none at all is dependent on other factors and isn't encoded in BAM files. As an example, most common single-end RNAseq experiments use a stranded protocol wherein alignments with a flag of 0 come from the "-" strand.

0
Entering edit mode

Hi Devon, I've spend a lot of time this week trying to REALLY understand this strand, flag, mapping uniqueness issue. I think I'm finally getting my head around it and I have to say I disagree with your first point here on flag 0 meaning. I will outline my thoughts below and please feel free to comment on them. Thanks so much for all the discussions so far.

I think a flag field of 0 does mean a read is mapped to the +/forward strand. And flag field indicates where the read itself maps to, between + and - strand of the reference genome.

For dUTP-based stranded RNA-seq experiments (Illumina TruSeq stranded), alignments with flag of 0 does mean the transcript/gene is on the -/reverse strand as you said, but at the same time, the read itself indeed maps to the +/forward strand, which is the antisense strand for a gene on the -/reverse strand. The "-" strand you mentioned is on the XS:A:- field (in Tophat or hisat2 output) and it's for which strand a transcript/gene (from which the read is derived) is on. It means the read is derived from a transcript/gene on the - strand (although it still maps to the + strand of the genome).

Lastly, flag value doesn't tell if a read maps uniquely or not. I need to look for NH:i: field to determine if a read is uniquely mapped.

samtools view bamfile.bam | awk '2==0' will output all the primary alignments mapped to the + strand samtools view bamfile.bam | awk '2==16' will output all the primary alignments mapped to the - strand

all the primary alignments contain uniquely mapped and primary alignments from multi-mapping reads.

uniquely mapped in the above sentence is determined by the aligner and I'm not sure what MAPQ value they have yet.

1
Entering edit mode

You need to differentiate between whether a read aligns to a particular strand from whether it came from a fragment arising from that strand (i.e., whether we map it to a particular strand). At the end of the day we really don't care what strand something aligns against, we only want to know where fragments are.

Regarding uniquely mapped reads, it's best practice to forget the term "unique" and instead filter by some meaningful MAPQ (~5 is usually reasonable). We really need to add a sticky FAQ at the top of the front page about "unique alignments", since they're brought up every month it seems.

0
Entering edit mode

I think I did here in the end of the 3rd paragraph:

It means the read is derived from a transcript/gene on the - strand (although it still maps to the + strand of the genome).

I agree that for this particular RNA-seq strand thing, we care about where the transcript is located.

0
Entering edit mode

Hi Devon, great discussion. I merged R1 and R2 and then aligned them to the reference. Now, I see that some of my reads have flag 16 and some have 0. Just trying to understand what is happening here in my case.