I have a set of Bowtie2 alignments to the transcriptome. I'm trying to make sense of the BAM file, but I'm finding some very strange results. For example there are a set of entries for a read (this is with paired-end input):
D00408:164:1:1101:6195:60564#TCCTGAGC#TAGATCGC 81 ENSDART00000085701 2597 0 4M1I5M1D115M = 2610 0 GTATAAGAGACAGGCCAATCTCCCGTTCCACCACATCCCAAAGCTGCTCTAGTGGATTGAGCTCTGGTGACTGTGGAGGCCATTTGAGTACAGTGAACTCATCGTCATGTTCACCAGTCTGAGAA GGGGGGGGG@GEGGE@CGGGGGGGGGGGGGGGGGGFF>GGGGEGGGGGGGGECGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGFEGFEGEGGGGGGGGGGFGFGGGGGGGGGGGGGGFFBB><3 AS:i:-54XS:i:-75 XN:i:0 XM:i:8 XO:i:2 XG:i:2 NM:i:10 MD:Z:0A3T2C1^T0G1T38T61T10T0 YT:Z:UP
D00408:164:1:1101:6195:60564#TCCTGAGC#TAGATCGC 337 ENSDART00000145155 133 0 7M4I101M5D13M ENSDART00000085701 2610 0 GTATAAGAGACAGGCCAATCTCCCGTTCCACCACATCCCAAAGCTGCTCTAGTGGATTGAGCTCTGGTGACTGTGGAGGCCATTTGAGTACAGTGAACTCATCGTCATGTTCACCAGTCTGAGAA GGGGGGGGG@GEGGE@CGGGGGGGGGGGGGGGGGGFF>GGGGEGGGGGGGGECGGGGGGGGGGEGGGGGGGGGGGGGGGGFGGGFEGFEGEGGGGGGGGGGFGFGGGGGGGGGGGGGGFFBB><3 AS:i:-75 XS:i:-75 XN:i:0 XM:i:8 XO:i:2 XG:i:9 NM:i:17 MD:Z:0A1C0C1T2T38T60^AAGAA5C6T0 YT:Z:UP
D00408:164:1:1101:6195:60564#TCCTGAGC#TAGATCGC 161 ENSDART00000085701 2610 3 125M = 2597 0 GCCAATCTCCCGTTCCACCACATCCCAAAGCTGCTCTAGTGGATTGAGCTCTGGTGACTGTGGAGGCCATTTGAGTACAGTGAACTCATCGTCATGTTCACCAGTCTGAGATCTTTTTTTTTTTT CCCBCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGFGGGGGGGGGGGGGEFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGG@####### AS:i:-36 XS:i:-61 XN:i:0 XM:i:9 XO:i:0 XG:i:0 NM:i:9 MD:Z:38T61T11G0A2G0A0G0C3A1 YT:Z:UP
D00408:164:1:1101:6195:60564#TCCTGAGC#TAGATCGC 417 ENSDART00000088520 2685 3 99M5D26M ENSDART00000085701 2597 0 GCCAATCTCCCGTTCCACCACATCCCAAAGCTGCTCTAGTGGATTGAGCTCTGGTGACTGTGGAGGCCATTTGAGTACAGTGAACTCATCGTCATGTTCACCAGTCTGAGATCTTTTTTTTTTTT CCCBCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGFGGGGGGGGGGGGGEFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGG@####### AS:i:-61 XS:i:-61 XN:i:0 XM:i:10 XO:i:1 XG:i:5 NM:i:15 MD:Z:38T3T56^AAGAA1G11G0A2G0A0G0C3A1 YT:Z:UP
D00408:164:1:1101:6195:60564#TCCTGAGC#TAGATCGC 417 ENSDART00000145155 142 3 99M5D26M ENSDART00000085701 2597 0 GCCAATCTCCCGTTCCACCACATCCCAAAGCTGCTCTAGTGGATTGAGCTCTGGTGACTGTGGAGGCCATTTGAGTACAGTGAACTCATCGTCATGTTCACCAGTCTGAGATCTTTTTTTTTTTT CCCBCGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGFGGGGGGGGGGGGGEFGGGGGGGGGGGGGGGGGGGGGGGGGGGFGGGGGGGGGGG@####### AS:i:-63 XS:i:-61 XN:i:0 XM:i:11 XO:i:1 XG:i:5 NM:i:16 MD:Z:38T60^AAGAA5C7G0A0C1G0A0G0C2C0A1 YT:Z:UP
In each of these records, we can see that the YT field has the value 'UP' --- from the Bowtie 2 manual:
Value of
UP
indicates the read was part of a pair but the pair failed to aligned either concordantly or discordantly.
However, when I filter unmapped reads from this file --- using samtools view -F 0x0004 --- and grep for this read, I get this exact set of records out! I'm at a loss for what is happening with these reads. Bowtie2 is saying that they failed to align either concordantly or discordantly. However, Samtools is still telling me that they're mapped (but not in a proper pair). Isn't this the definition of a discordant alignment (i.e. both ends of a paired-end read map, but not concordantly?). I'm not sure how I should interpret these records. Should I consider these reads "mapped" or not? If so, should I consider them mapped as orphans or in pairs? And if they're to be considered to be mapped in pairs, then what are the pairs --- certainly they're not the adjacent alignments and, in fact, there aren't even an even number of records so I'm not sure how they should be paired! Any help dealing with this confusion would be greatly appreciated!
Thank you for the explanation. However, do you have any idea why Bowtie 2 is annotating these with
YT:Z:UP
? I would consider the read of a pair that both map, but not concordantly, as a discordant alignment --- which Bowtie 2 says it marks withYT:Z:DP
. It's saying these reads aligned neither concordantly or discordantly, but it appears they did align discordantly, in exactly the way you explained. My followup question, is how would you determine the reads that represent a mapped pair here? I mean, I see how you did it by looking at the alignments and seeing what made sense, but is there any programmatic way to determine which of these mapping should be paired together?May be 'YT:Z:UP' is used when aligner is not sure if the paired reads are mapped as a proper pair or not. When it is sure that they are not mapped in proper pair it uses 'DP' and when mapped in proper pair it uses 'CP'. I am confused too as why bowtie2 didn't regard read pairs mapping on different chromosomes as 'DP'. Why it is still not sure and using 'UP'. See below for the code:
Go to this link: http://broadinstitute.github.io/picard/explain-flags.html
Type 81. Gives you read is paired, read is reverse strand and first in pair. Now unmark everything.
Now, to get the flag for the second read in pair, you will have to click read is paired, second in pair and mate reverse strand (you know first read is on reverse strand). You will get 161. This means both these reads are in pair.
337,417 will have all these features but it will also mark "not primary alignment" feature.
I see; but then, isn't every reverse strand "first-in-pair" read implicitly paired with every "second-in-pair", read whose mate is on the reverse strand? If one or both of the reads maps many places, this could lead to a large number of "implicit" alignments.
Maybe this is just the nature of the way these things are encoded, but it would seem more intuitive to just always place aligned pairs (even discordant) together, repeating the alignment for one of the mates if necessary. This existing encoding makes things particularly difficult because you have to essentially look ahead to the end of all of the records for this read, and then generate mapped pairs by implicitly pairing all "compatible" discordant alignments.
Also, I'm assuming that the UP tag placed by Bowtie 2 is a mistake, since these reads do, in fact, map discordantly (the UP should be DP).
There is no "second-in-pair" flag - it's "last-in-fragment". The SAM specification tries hard not to talk about pairs so the format would be future proof :)
I'd have to double check the code to be sure, but I believe bowtie2 considers these alignments "mixed" rather than discordant, meaning that it's aligning the reads as singletons...thus leading to the YT:Z:UP aux tag.
Processing mates like this as pairs is actually rather difficult, since not every alignment will have an obvious mate (e.g. the alignment to ENSDART00000088520:2685) and you can easily run into cases where there's more than one possible mate (e.g., read2 maps to a tandom repeat). Those end up not being that uncommon in practice. Note that the MAPQ on these is 0 to begin with, so none of these alignments are worth much...so don't stress over them :)
Yes, I totally forgot to say that none of these alignments will be considered by downstream tools including variant callers or RPKM generators so you should take it is easy if you have a concern that they may screw the results.
Actually, I'm working on my alignment-based transcript expression estimation tool (https://github.com/kingsfordgroup/sailfish/tree/feature/fastio), which was exactly what led to this question :). I was trying to expand the types of alignments to consider when estimating isoform abundance. For example, we always consider concordant alignments, but it may also be helpful to consider orphaned reads (e.g. RSEM doesn't do this but eXpress does). We can then go one step further and decide whether or not it is helpful to consider alignments of the type represented in my original post (e.g. by considering discordant alignments that happen to map to the same contig / transcript, or by considering the ends of a discordant mapping separately as orphaned reads).
Try STAR aligner and see how does it flag those sequences in the sam file. Does it too flag the primary alignment for those reads as discordant (81) or it flags them as concordant (83) ? You can then decide which aligner you want to use at the first level to align those reads. This is just a suggestion. You may have already put a lot of work to make your code work with Bowtie2 or Tophat's SAM file.
Actually, I usually use STAR and recommend it to people using my software. However, this BAM file was provided by a user who had used Bowtie 2 for alignment, and I noticed some reads like this that I'd not seen before in my STAR alignments. I'm trying to decide how to have my tool play as nicely as possible with the wide range of different aligners, so I want to make use of as little specialized info (e.g. tags) as possible.