Question: Paired-End Sam Flag Mismatch During Filtering
1
gravatar for Andreas
6.8 years ago by
Andreas2.4k
Singapore
Andreas2.4k wrote:

Hi BioStar users,

while removing human contamination from paired-end metagenomic Illumina-shotgun data I ran into a strange problem. In a first step I mapped the paired-end data with BWA (sampe) against hg19. Then I only wanted to keep read that didn't map to hg19 and whose mates also didn't map, i.e. sam flags 0x4 and 0x8 both have to be set for the two reads in a pair to be considered clean (I'm not interested in single unmapped reads)

If I filter accordingly (no matter how) I end up with inconsistent mate-pair information / odd number of reads and here is why: there is at least one hit against the very end of chrM for a pair of reads, such that one read would span the end of chrM (also, both are strangely mapping to the same position). So far, so good: I've seen this before when mapping against circular or small chromosomes. BWA somehow deals with this and sets the alignment flag to "unmapped", but keeps the alignment coordinates in the SAM output.

Now, the first read gets flag 77 (read paired, read unmapped, mate unmapped, first in pair), the other flag 133 (read paired, read unmapped, second in pair), i.e. the mate unmapped flag is not set in the second read. The trouble starts when you only keep reads if they and their corresponding mate are unmapped, because all of a sudden you miss one read / have one too many.

Here the offending mate-pair (both unmapped, but one "mate unmapped" missing):

SRR341644.4631996    77    chrM    16500    37    75M    =    16500    0    CATCTGGTTCCTACTTCAGGGCCATAAAGCCTAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGGG    CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCBCCCCCCCCCCCCCCCCCCCCCCCDBCCC    XT:A:U    NM:i:2    SM:i:37    AM:i:0    X0:i:1    X1:i:0    XM:i:2    XO:i:0    XG:i:0    MD:Z:73A0A0
SRR341644.4631996    133    chrM    16500    0    *    =    16500    0    TGATAGACCTGTGATCCATCGTGATGTCTTATTTAAGGGGAACGTGTGGGCTATTTAGGCTTTATGGCCCTGA    CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCDCCCCCCCCCBCCC>CCCCCCCCCCCCCC

Why is this flag mismatch happening in the first place? To be more precise: while un-mapping the previously mapped read pair, the flags in only one of them are set properly. Both should set have 'mate unmapped' set. Why does this happen? Is this a bug?

Andreas

next-gen samtools sam filtering • 5.3k views
ADD COMMENTlink modified 3.5 years ago by andrewmaltezthomas0 • written 6.8 years ago by Andreas2.4k
1
gravatar for swbarnes2
6.8 years ago by
swbarnes26.7k
United States
swbarnes26.7k wrote:

I assume you do not want the read that hangs off the end of chrM?

Rather than using flags, why don't you use awk to filter out reads that have entries in either the mapped chromosome column, or the mate mapped chromosome column?

ADD COMMENTlink written 6.8 years ago by swbarnes26.7k

That's what I do now. Technically this shouldn't be necessary though if the flags were correct, i.e. if they would indicate that both reads and their respective mates are unmapped.

ADD REPLYlink written 6.8 years ago by Andreas2.4k
1
gravatar for Nathan S. Watson-Haigh
6.8 years ago by
Adelaide
Nathan S. Watson-Haigh190 wrote:

Considering the circular nature of this reference DNA may be confusing the situation? It appears, to me at least, that the issue is in an incorrect flag for the first read.

I would say that the flag of the first alignment (77) is wrong and should be (73) i.e. read is mapped. This is clear from the rest of the alignment info i.e. all the read maps perfectly to the reference: cigar string of 75M and a mapping quality of 37.

For circular DNA I might expect reads that spanned across the "join" of the two ends of the linearised reference sequence to be potentially troublesome to get an alignment. Depending on the aligner and parameters used, you might get two partial alignments to opposite ends of the reference for one of the reads in a pair while its mate has a single alignment across its full length at some distance from the ends of the reference. The read that spans the join may not have both its alignment found/reported i.e. it may have a partial alignment to just one end of the reference or the aligner may not have found any alignment.

So my question would be, what fudging is BWA actually trying to do/achieve by setting that read's flag as unmapped when it clearly is mapped? This may also be a bug in BWA. Maybe this is the reason behind the existence of "samtools fixmate" and why I'd probably run it on the output of BWA.

ADD COMMENTlink written 6.8 years ago by Nathan S. Watson-Haigh190

See SeqAnswers for a thread, dated early 2010, that seems related: http://seqanswers.com/forums/showthread.php?t=5347

In particular, comment #10 provides an example that seems to show similar issues to yours with with the flags. A comment (#13) by Heng states: "Sometimes bwa flags read1 as unmapped, but for read2, the mate-unmapped flag is not set. This is a bug." And then later (comment #16): "It is the mate-unmap bit that can be mislabeled. The unmap bit is always set correctly."

i.e. you can guarantee that if the 0x4 bit is set it's correct, but the 0x8 bit may not be correct!

Check if you're running the latest version of BWA and see if that fixes the issue?

ADD REPLYlink modified 6.8 years ago • written 6.8 years ago by Nathan S. Watson-Haigh190

I was already using 0.6.2-r126 which is the most recent version.

Thanks for digging ever deeper. Nice work.

ADD REPLYlink written 6.8 years ago by Andreas2.4k
0
gravatar for Nathan S. Watson-Haigh
6.8 years ago by
Adelaide
Nathan S. Watson-Haigh190 wrote:

Have you tied using samtools fixmate?

An except from the samtools website (http://samtools.sourceforge.net/samtools.shtml).

fixmate     samtools fixmate <in.nameSrt.bam> <out.bam>
            Fill in mate coordinates, ISIZE and mate related flags from a name-sorted alignment.
ADD COMMENTlink written 6.8 years ago by Nathan S. Watson-Haigh190
1

That should certainly be one way to fix the information. Still, why does it happen in the first place?

ADD REPLYlink written 6.8 years ago by Andreas2.4k

I suspect it's a combination of the way BWA is calculating the flags and the interpretation of (partial)alignments at the ends of chrM due to its circular nature but linear representation.

ADD REPLYlink written 6.8 years ago by Nathan S. Watson-Haigh190

Yes, that explains the "un-mapping" of the "mapped" read and the update of its mapping info. But the mate-pair info in the mate is not getting updated properly. It's only updated in one of the two. Almost looks like a bug to me.

ADD REPLYlink written 6.8 years ago by Andreas2.4k

Can you supply an example of the alignment lines, as generated by BWA, of such a pair?

ADD REPLYlink written 6.8 years ago by Nathan S. Watson-Haigh190

Added (see above). These come straight from BWA and only occur once in the BAM file. Note that, both are unaligned (as per flag) but only one is marked with "mate unmapped"

ADD REPLYlink written 6.8 years ago by Andreas2.4k
0
gravatar for andrewmaltezthomas
3.5 years ago by
andrewmaltezthomas0 wrote:

While dealing with TCGA WGS bam files I've come across the same problem with paired-end sequencing data.

If I use "samtools view -f 13 in.bam > out.bam" and then use "bam2fastx" there is one pair where only one of the reads is caught by this flag, the other isnt. This causes an error. The flags:

(Flag 69) - Not caught - R1 - read paired, read unmapped, first in pair

(Flag 141) - Caught - R2 - read paired, read unmapped, mate unmapped, second in pair

Alignments:

HWI-ST1222:5:1103:3409:129634#0 69 MT 16522 0 * = 16522 0 AGAGATGTGTTTAAGTGCTGTGGCCAGAAGCGGGGGTAGGGGGGGGTTTGG abbeeeecgggggiifghhdhhiiihgffdgghiigX]bccc_VaWOV^a_ RG:Z:120805_SN1222_0143_BD172BACXX_5 SM:Z:TCGA-BR-4188-11A-01D-1128

HWI-ST1222:5:1103:3409:129634#0 141 MT 16522 37 51M = 16522 0 TAAAGCATAAATAGCCCACACGTTCCCCTTAAATAAGACATCACGATGGAT _bbeeeeegggggihihiiiiihdhiiiiiiiiiiihifiidefcghihh] XT:A:U NM:i:2 SM:i:37 AM:i:0 X0:i:1 X1:i:0 XM:i:2 XO:i:0 XG:i:0 MD:Z:6C41C2 RG:Z:120805_SN1222_0143_BD172BACXX_5 SM:Z:TCGA-BR-4188-11A-01D-1128

I'm trying "samtools fixmate", but as these bam files are very large I'm afraid this will not be feasible. I'll try the awk approach next.

ADD COMMENTlink written 3.5 years ago by andrewmaltezthomas0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1948 users visited in the last hour