Hi,
I have mapped genomic Illumina 2x125bp reads to a reference transcriptome with bowtie2 in --very-sensitive-local
mode. Mostly, the output makes sense, but I'm confused about why bowtie2 is calling some of my mapped read pairs discordant (based on the bit flag and the YT:Z:DP tag). A few examples from the sam output:
HWI-D00550:274:C77P6ANXX:5:1310:7220:87933 161 Contig138 157 42 45S80M = 281 294 TATATATACACACGCATGTAATATTTTTTAGGTCTAAAAATAAAGTGAGAGATGGTCCAAGAAAAGAAACTTAATGTAATGTTAACGTCTTCATTTATTAAAACTGGGACAATGACAGTTTACTT BBBC@GEGGGGEGGGGGGGGGD@:DFCDEG1BF1C1<EF>GCDGGGF10=C=BBDFF:0=E00@@FGBGBG@FCCGG@CC@@@BGG///8/;/BF?GGEGEEG/F/C/>/:>G/77/BF//B//9 AS:i:141 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:19G21T30A7 YS:i:245 YT:Z:DP RG:Z:11675X173
HWI-D00550:274:C77P6ANXX:5:1310:7220:87933 81 Contig138 281 42 125M = 157 -294 TCTGAACACAATATATTAAAACCAGTCTTAACATGAAAAATGCACTTGTGTTGTATCAAAATGTTTTCATACAGATGTGGCCGCAGGCAGATGTGCACAACAGCTGCCAAAACATGGATTTTTTT 9=5/GGCFCGGEGGGGGGGC:GEEEGCGF@GGGGGGCGGFEGGEGGGF@FCFFEGBEGGGDBDGECF1GGFEGGGGGGGGGGGGGGGFGGGGGGGFDGGGGGGGGGGGGGGGGGGGFGGEA03BA AS:i:245 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:121A3 YS:i:141 YT:Z:DP RG:Z:11675X173
HWI-D00550:274:C77P6ANXX:5:2306:1952:38805 161 Contig245 1181 36 46S45M = 1304 281 GTTGTATGTAGAAAAGAGAAACACACCTAATATGTTTGGCTATTACCTTTTTCTCCAGCTTCATGTCCAGCCTCAGATCGATGAAAATGGG BBBBBGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEDGGC0=EFB@GGGGCE>FDFEGG0:0BB>FGFGDFGG> AS:i:83 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:36AYS:i:204 YT:Z:DP RG:Z:11675X173
HWI-D00550:274:C77P6ANXX:5:2306:1952:38805 81 Contig245 1304 36 112M13S = 1181 -281 GGGATAGTACCTCATTATGTACACGTCATCTACAGGCGCCCAAGCCGTCTTCCCAAAGGGCTTGTGTCTGTCTGTGTCATCAATGACCTTCTTTTCTTTCTTTTCTTTCTTTTCTTTCTTTGCAT GGG==B@B6-A9/=CGFF/4.FEG=GGGGGGGGGGAGFGFGGGA>FGGGGGGGEGEDFCFGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGGEGGGGGGGGGGGGGGGGEEB>@=B AS:i:204 XN:i:0 XM:i:3 XO:i:0 XG:i:0 NM:i:3 MD:Z:24A13T30T42 YS:i:83 YT:Z:DP RG:Z:11675X173
For both of these pairs, the mapping quality for both mates was high (>36), the inferred fragment length is greater than the minimum of 250 I had specified in the -I parameter (I also tried mapping with -I 0 and still saw these pairs classified as discordant), and the read pair orientation indicated by the leftmost mapping positions of the mates mapping to the forward and reverse strands is similar to what I see for read pairs that bowtie2 classifies as concordantly mapped (so should be forward-reverse).
I would really appreciate it if someone can help point out what I'm missing here and why these reads should not be considered mapped as a proper pair.
Thanks very much!
I think it's because your actual mapped DNA region is less than 250 for both.
So it starts at 281, goes for 125bp, and the start of the mate is -294 backwards. This would put it at 112.
But the mate doesn't start at 112, its POS (of first mapped position) is 157. That's because there's 45bp of soft-clipped stuff before it. 112+45 = 157.
So how much is actually mapped?
(281+125M) - 157 = 249bp
The other pair is 235bp
Maybe that explains it? I don't really know much about Bowtie2 and why it does the things it does.
Thank you very much for pointing that out, John! That makes sense. I thought the softclipped parts would be included in the estimation of fragment sizes, but I can see the logic in not doing that. I really appreciate your help!
No worries, if it's any consolation, I spent about 20 minutes looking at read flags, Bowtie2 tags, phred scores - even drawing little pictures of read mapping orientation - before figuring out TLEN might not be the distance from the first mapped base to the last :P They don't make it easy for us eh!! :)
Can u try with
--no-discordant
without mentioning-l
and see what happens to the reads? Maybe you can subset few reads which includes the above mentioned reads. I could also see the reads are soft clipped, and I don't know if it effects the insert size calculation.Thanks for this suggestion. I thought that I had already done this, but I must have made a mistake because now when I tried mapping again without specifying the
-I
parameter (so using the default of 0), bowtie2 classifies these two read pairs as concordant. So the softclipped parts of the reads must have been what caused the confusion, as John suggested. Thanks!Glad to know. :-)