Paired-read alignment length from BAM/SAM file
0
0
Entering edit mode
6.9 years ago
Soumitra Pal ▴ 10

Hi, I have aligned a set of paired reads using BWA. I want to know the alignment length of each of the pairs. By length I mean the span of the reference genome where these two reads in the pair are mapped. I was thinking of using the following fields in the detailed sections of the bam file.

4. POS
6. CIGAR
9. TLEN  (ideally this should give this result (positive one from the tow values for the pair))

However, I am not sure how this TLEN was calculated by BWA. Here are six examples from the bam. For brevity I am reporting only the columns 1,3,4,6,9:

Pair-1

D00545:139:C9VKJANXX:1:1101:1032:54904  chr11   33443776    2S99M   294
D00545:139:C9VKJANXX:1:1101:1032:54904  chr11   33444001    32S69M  -294

Here 294 seems the correct length.

Pair-2

D00545:139:C9VKJANXX:1:1101:1032:57548  chr1    59493945    101M    -124
D00545:139:C9VKJANXX:1:1101:1032:57548  chr1    59493922    69M32S  124

Pair-3

D00545:139:C9VKJANXX:1:1101:1032:62552  chr10   22566910    101M    -207
D00545:139:C9VKJANXX:1:1101:1032:62552  chr10   22566804    69M32S  207

Pair-4

D00545:139:C9VKJANXX:1:1101:1032:63292  chr17   43439964    101M    185
D00545:139:C9VKJANXX:1:1101:1032:63292  chr17   43440080    32S69M  -185

For pairs 2,3,4 also seem to have correct lengths 124,207,185 respectively.

However, problem comes with the following pairs.

Pair-5

D00545:139:C9VKJANXX:1:1101:1032:57141  chr19   49390966    101M    -25
D00545:139:C9VKJANXX:1:1101:1032:57141  chr19   49391042    69M32S  25

Here the left one is mapped in the region (49390966,49391066) and the second one in the region (49391042,49391042+68=49391110). So the length should be = (49391110 - 49390966 + 1) = 145 which does not match with 25.

Another pair.

Pair-6

D00545:139:C9VKJANXX:1:1101:1032:73831  chr20   60365970    52M49S  -128
D00545:139:C9VKJANXX:1:1101:1032:73831  chr20   60365784    37S60M4S    128

Here the left one is mapped in the region (60365784, 60365784+60-1) and the second one in the region (60365970,60365970+52=60366021). So the length should be = (60366021 - 60365784 + 1) = 238 which does not match with 128.

Could anybody help? Is there an existing tool to get this length I am interested about?

Thank you so much!

Soumitra

EDIT: All the columns

D00545:139:C9VKJANXX:1:1101:1032:54904  99  chr11   33443776    60  2S99M   =   33444001    294 NTGGCACAGGATAGTTCATAAGCATCCCAAGGTGACTCCAAGGCACACATGTGATAANACAACCNNNNCNNTNTNNNTTCCTTCCCCAATACATTGTATTC   #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF#<FFFFF####<##<#<###<<FFFFFFFFFFFFFFFFFFFFFF   NM:i:11 MD:Z:55A6A0C0T0G1C0C1G1G0T0A24  AS:i:77 XS:i:0
D00545:139:C9VKJANXX:1:1101:1032:54904  147 chr11   33444001    60  32S69M  =   33443776    -294    NNNNANNNNNNNNNNNNCCNNCTANGNNTNNNTCAGAAGCNACTNCNNCNNNGAGNCTCTCNNNNGNTGAATGGACTCCCTCCCTTCCCTCCCNCCCNACC   ####/############<<##<</#<##<###BFFFFB<<#F<<#<##<###F<<#FFF<<####<#FFFFFFFFFBFFFFFFFFFFFFBF<B#B<<#BBB   NM:i:16 MD:Z:8T3T1C0T1C0T0T3A5C0T0T0A1C26T3C0T2 AS:i:37 XS:i:20
D00545:139:C9VKJANXX:1:1101:1032:57548  83  chr1    59493945    60  101M    =   59493922    -124    GTCTTATAGCGTGAGTAGAAATTGNNNGNGNNTNNNNAGGGTGNCAGAGTACAAAGTACAAGGTGCAATTGGAGAACAAAGGCCATGAGGGAGGAGGCAGN   FFB<FFFF/B<FFFFBFFF<F<<<###<#/##<####FF<F<<#FFFF<FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFB<B/B<<BB<<#   NM:i:13 MD:Z:24A0C0C1G1G0G1A0C0A0A6G53G2T0  AS:i:75 XS:i:47
D00545:139:C9VKJANXX:1:1101:1032:57548  163 chr1    59493922    60  69M32S  =   59493945    124 TGGNAAGNTGAGGTCTGAAGAAAGTCTTATAGCGNGNNNNGAAATNGACNNNGNNTNCAANGGGTGGCANNNTNNANAGTNNAANNNNNNNNNNNNGNNNN   BBB#<<F#BBFFFFFFFFFFFFFFFFFFFFFFFF#<####<<<FF#<<F###<##<#<<F#<<FFFFFF###<##<#<<<##7<############<####   NM:i:16 MD:Z:0G2A3G26T1A0G0T0A5T3C0G0G1G0G1A3A8 AS:i:38 XS:i:28
D00545:139:C9VKJANXX:1:1101:1032:62552  83  chr10   22566910    60  101M    =   22566804    -207    TAGTAAAATAACAGAGTTACTCAGNNNGNANNGNNNGGTTGTTNACGGTGGAGGGTGTCCAGGTTCTTGGCATCTTGAACAAAGAATTCGTCAAAACGCAN   FFFFFFFFFFFFFFFFFFFFFF<<###<#<##<###FFFFF<<#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBB<<#   NM:i:11 MD:Z:24G0T0G1G1T0A1C0A0T7A56C0  AS:i:80 XS:i:45
D00545:139:C9VKJANXX:1:1101:1032:62552  163 chr10   22566804    60  69M32S  =   22566910    207 AAANCTGNCTCATATGTATACAGACTGCAGCAGGNANNNNACGTCNGGCNNNGNNGNACANTACATGTGNNNCNNTNGTANNGCNNNNNNNNNNNNANNNN   BBB#<BF#B<FFFFFFFFFFFFFFFFFFFFFFFF#<####<<FFF#<<F###<##<#<<F#<<FFFFFF###<##<#<<<##<<############<####   NM:i:16 MD:Z:1C1C3G26A1G0G0C0A5T3T0A0C1A0T1C3A8 AS:i:37 XS:i:0
D00545:139:C9VKJANXX:1:1101:1032:63292  99  chr17   43439964    60  101M    =   43440080    185 NCTTGCTAATGTTTGGTATTCTTCAACAGAGCACCAAGGTTTTCTCCCCACGATCAANACCAAGANNNCNNANGNNNTTACCCAACACTAAGACTTCAGAT   #<<<B/FF/<FFBFFFFFFFFFFFFFF</FFFFB<<BFFBFFFFFFFFFFF//F/</#/</<FFF###/##/#/###/<<<F///F/FFB/BFFFFF/7/F   NM:i:12 MD:Z:0G56C7C0T0T1A0G1T1C0G0T0G23    AS:i:75 XS:i:0
D00545:139:C9VKJANXX:1:1101:1032:63292  147 chr17   43440080    60  32S69M  =   43439964    -185    NNNNTNNNNNNNNNNNNACNNGCANCNNGNNNGAGAGGGCNGATNTNNTNNNACANAGGGGNNNNGNTGACTGGGGGATCCTTATAAGTCCTTNCAGNGGG   ####/############<<##/<<#<##<###FFFBF<<<#</<#<##<###<</#BF/<<####<#//BFFFFBF<</BBFFFFFFB//FBB#F<<#BBB   NM:i:15 MD:Z:8A3C1T0C1C0G0C3C5A0G0T0G1T26G3A3   AS:i:39 XS:i:0


D00545:139:C9VKJANXX:1:1101:1032:57141  83  chr19   49390966    60  101M    =   49391042    -25 CACCACTCCCGGCTAATTTTTGTANNNTNANNGNNNNCGGGGTNTCACCATATTGGTCAGGCTGGTCTTGAACTCCGGACCTCAGGTGATCCACCCGCTTN   FFFFBFFFFFFFFFFFFFFFB<<<###7#<##<####FFFF<<#FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBB<<#   NM:i:12 MD:Z:24T0T0T1T1G0T1G0A0G0A6T56C0    AS:i:78 XS:i:54
D00545:139:C9VKJANXX:1:1101:1032:57141  163 chr19   49391042    42  69M32S  =   49390966    25  GGANCTCNGGTGATCCACCCGCTTCAGCCTCCCANANNNNTGGGANTATNNNCNNGNGCCNCTGTGCCCNNNTNNCNGGCNNAANNNNNNNNNNNNCNNNN   BBB#<FF#BBFFFFFFFFFFFFFFFFFFFF<FFF#<####<<<F<#<<<###<##<#<<<#<<FFFFFF###<##<#<<<##<<############7####   NM:i:15 MD:Z:3C3A26A1G0T0G0C5T3A0G0G1G0T1A3A8   AS:i:39 XS:i:33
D00545:139:C9VKJANXX:1:1101:1032:73831  97  chr20   60365970    51  52M49S  =   60365784    -128    NCACCCATTCGTTCATCTCTCCATTCATCCATTCATCCATCCACCCATCCATTCATCNCTCTATCNNNTNNTNCNNNCATCCTTTCATCTCTCCATCCATT   #<<BBFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFBFFFFFFFFFFFFFB#<<FFFFB###<##<#<###</<FBFFFFFFBFFFFFBFFF<FF   NM:i:1  MD:Z:0A51   AS:i:51 XS:i:36
D00545:139:C9VKJANXX:1:1101:1032:73831  145 chr20   60365784    34  37S60M4S    =   60365970    128 NNNNCNNNNNNNNNNNNCTNNCTANCNNTNNNTCCATTCANCCTNTCNTNNNTCCNTCCATNNNNCNATCCATCTATCCATTCATCTCTCTCTNCATNACA   ####<############<<##<</#<##<###BFFFFB<<#F<<#<<#<###<<<#<B<<<####<#FFFFFFFFFFFF<FFFFFF/FFBBBB#F<<#BBB   NM:i:13 MD:Z:3T3T2A1C0T0C3A5T0T0A0T1T26C3   AS:i:34 XS:i:0
bam sam paired-read alignment • 4.0k views
ADD COMMENT
2
Entering edit mode

From an existing sam/bam file, BBMap's reformat.sh can calculate the insert size distribution using the TLEN field:

reformat.sh in=mapped.sam ihist=ihist.txt

...but it assumes the aligner was correct in calculating TLEN. If you map the raw reads using BBMap, you will get what I consider to be a more trustworthy result, which compensates for things like overlapping reads, insert sizes shorter than read length, and indels:

bbmap.sh in1=r1.fq in2=r2.fq ref=ref.fa ihist=ihist.txt

You can also generate an alignment-free insert-size histogram with BBMerge if your reads mostly overlap:

bbmerge.sh in1=r1.fq in2=r2.fq ihist=ihist.txt

...or via extension, if they almost overlap, and you have sufficient coverage and memory:

bbmerge-auto.sh in1=r1.fq in2=r2.fq ihist=ihist.txt extend2=100 prefilter

Note that for human-sized genomes and WGS that last approach would require a substantial amount of memory, though for exomes it wouldn't.

ADD REPLY
0
Entering edit mode

Thanks Brian for your suggestion. I was not familiar with bbmap. Is there any paper that I could refer to?

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode

Looks like a bwa bug, what version are you using?

ADD REPLY
0
Entering edit mode

Program: bwa (alignment via Burrows-Wheeler transformation) Version: 0.7.12-r1039 Contact: Heng Li lh3@sanger.ac.uk

ADD REPLY
1
Entering edit mode

Try the most recent version (0.7.15) and see if that produces more reasonable results. Also, if that doesn't solve the problem then please post the entire alignment so we can see the flags.

ADD REPLY
0
Entering edit mode

Thank you, Devon Ryan, for pointing me to the latest version. I shall run that version. However, edited the question with all the columns in the SAM file, hoping that might help you find the issue.

ADD REPLY
0
Entering edit mode

Thanks for the update, it further suggests that this is a bwa bug.

ADD REPLY
0
Entering edit mode

Could you, Devon Ryan, please let me know what I need to do to file this bug?

ADD REPLY
0
Entering edit mode

Create a new issue on GitHub: https://github.com/lh3/bwa/issues

ADD REPLY
0
Entering edit mode

But it is a potential bug in BWA not samtools.

ADD REPLY
0
Entering edit mode

Sorry! Post above changed to reflect right place. You may need to create a GitHub account (free).

ADD REPLY
0
Entering edit mode

Tagging: lh3

ADD REPLY
0
Entering edit mode

I came to know about 'samtools stats' command, which among other things report insert size distribution. Does samtools use TLEN field? Or does it compute the insert length on its own?

ADD REPLY
0
Entering edit mode

Sorry for not trying this earlier. Apparently, 'samtools stats' just uses the TLEN field, as seen below:

$ samtools stats test.bam | grep ^IS | cut -f2,3 | awk '{if ($2) print}'
25  1
124 1
128 1
185 1
207 1
294 1
ADD REPLY

Login before adding your answer.

Traffic: 3803 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6