Question: Question About Bwa Pe Mapping And Calling "Properly Mapping Pairs" In Samtools
0
gravatar for neal.platt
6.0 years ago by
neal.platt220
United States
neal.platt220 wrote:

I am using BWA to map PE reads from 300-400bp fragments to a reference genome. After mapping I am using SAMtools to filter out "properly mapping pairs" using the -f 0x002 option.

Here is my BWA/SAMtools syntaxt

 $> bwa aln -t 12 -f <input index read 1> myoLuc2.bwa.bwtsw <read 1 fastq> 
 $> bwa aln -t 12 -f <input index read 2> myoLuc2.bwa.bwtsw <read 2 fastq> 

 $> bwa sampe -a 400 myoLuc2.bwa.bwtsw <input index read 1> <input index read 2> <read1 fastq> <read2 fastq> >mluc_sampe.sam

 $> samtools view -Sb -f 0x002 mluc_sampe.sam | bedtools bamtobed -i - | bedtools sort -i - >mluc_sampe_sorted.bed

Looking at the final BED file...in some cases a read pair maps as would be expected:

AAPE02060494    12296   12318   D3NH4HQ1:150:C1FTFACXX:7:2211:19990:32392/1     15      +
AAPE02060494    12619   12680   D3NH4HQ1:150:C1FTFACXX:7:2211:19990:32392/2     15      -

In this case the ...32392/1 maps ~300 bp downstream from 32392/2 and on the opposite strand.

However the VAST MAJORITY of the reads look like the following:

AAPE02062872    9054    9148    D3NH4HQ1:150:C1FTFACXX:7:1212:7825:54903/1      60      +
AAPE02062872    9054    9148    D3NH4HQ1:150:C1FTFACXX:7:1216:8413:88366/1      60      +
AAPE02062872    9054    9148    D3NH4HQ1:150:C1FTFACXX:7:1303:1882:25192/1      60      +
AAPE02062872    9054    9148    D3NH4HQ1:150:C1FTFACXX:7:2109:3484:63930/1      60      +
AAPE02062872    9310    9371    D3NH4HQ1:150:C1FTFACXX:7:1303:18099:51689/2     60      -
AAPE02062872    9311    9371    D3NH4HQ1:150:C1FTFACXX:7:1212:4340:70026/2      60      -
AAPE02062872    9312    9371    D3NH4HQ1:150:C1FTFACXX:7:1301:19221:37959/2     60      -
AAPE02062872    9312    9371    D3NH4HQ1:150:C1FTFACXX:7:2109:14642:87311/2     60      -

In the above scenario the first read ...54903/1 is not mapped with the 54903/2 read. These are the only 8 reads mapping to the AAPE02062872 contig. In fact the 54903/2 read is not mapped period. None of the 8 reads in the above example comprise pairs from the same flowcell location (fragment) though to the eye, they "look" like properly mapping pairs. Should these have been filtered out by SAMtools? Does SAMtools or BWA do something that I am not aware of to justify calling these "properly mapping?"

I checked to see if my fastq files were out of order causing BWA to combine reads that shouldn't be.

The ...54903 reads are both in the input fastq files and at the same line number within that file.

$> grep -n 54903 mluc-FLK_sorted.fq
   7861:@D3NH4HQ1:150:C1FTFACXX:7:1212:7825:54903 2:N:0:GCCTAA
$> grep -n 54903 mluc-VES_sorted.fq 
   7861:@D3NH4HQ1:150:C1FTFACXX:7:1212:7825:54903 1:N:0:GCCTAA

Am I missing something? Any suggestions or hints? Thanks for your help in advance.


I have included the SAM output below. It is pretty ugly as far as formatting...not sure how to make it prettier.

$> grep AAPE02062872 mluc_sampe.sam

@SQ     SN:AAPE02062872 LN:18857
D3NH4HQ1:150:C1FTFACXX:7:1212:7825:54903        99      AAPE02062872    9055    60      94M     =       9312    317     NTAAAAAAAATATATTTTTTAAAGATGGGGCCTAGGATGTACCAACTGATTATATGCTAATTATTGGTTGGAGGTGAGGGAATCTTAGAACTTG #2<CFHHIJJJJJJJJJJJJJJIJIJJJJJJJJJIJHIIGIGHHEEHFFFFFFFEEEEDEEEDEFDDDDDDCDB<A>BDBDDBDDCACCCACCA  XT:A:U  NM:i:73 SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:2XO:i:0   XG:i:0  MD:Z:0T0G0T0T0T0T0G0C0C0T1T0A0G1G0G0A0A0A1G0G0A0C0A0C0C0A0T0T0G0A1A0T0T4T0G0T0T0T1A0G0G1T0C1A1T0A1T0G1T2C0A1A0T1C0T0C0A0T0T1A0A0G1A0A0C0A0T0T0T0T0A0A0A1
D3NH4HQ1:150:C1FTFACXX:7:1212:4340:70026        147     AAPE02062872    9312    60      60M     =       9055    -317    ATAAGCGATTGTATAGAAGCCTCAATTCACCAGACTCTATACTAAAAATGGGTGGATTTT  C>DDDDDDCEDEEDDDDDDDDDCDBDDCACDDCCDDEEEEEEDFFFFFHHHHJJJIJJJJ     XT:A:U  NM:i:1  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:4C55
D3NH4HQ1:150:C1FTFACXX:7:1216:8413:88366        99      AAPE02062872    9055    60      94M     =       9313    317     TCAGGAAAAACCCCCACACAATTTTTAATAAACAGAATTCTGTTACATGACAGTACTGGTGTTACTGCTAAAGGCTTGCAACAATTATCTAATT 8BFHHHHJJJJJJIJJJJJJEIJJJJJJJGIIIJJIJJIJIJGIGGIIJIJIJGIIJJHEGHHHHHHFFFFCEACEEDDCDDDDDD<CCDC:A>  XT:A:U  NM:i:70 SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:1XO:i:0   XG:i:0  MD:Z:1G0T0T0T0T0G0C0C0T0T0T0A0G0T0G0G1A2G0G0A0C0A0C0C0A0T0T0G0A1A0T0T1G0T0A0T0G1T0T0T0A1G0T0T0C0A1G1A0A4T0T0C0A0T0A0T0G0C0T0C0A3A1G0A2C0A0T1T3A0G0
D3NH4HQ1:150:C1FTFACXX:7:1301:19221:37959       147     AAPE02062872    9313    60      59M     =       9055    -317    TAAGCGATTGTATAGAAGCCTCAATTCACCAGACTCTATACTAAAAATGGGTGGATTTT   @@B@BCCDDDEEDDDDDDCACDDC>DDDCCCC@DDEEFCAECDBCDFFHHHGIIGJIIH      XT:A:U  NM:i:1  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:3C55
D3NH4HQ1:150:C1FTFACXX:7:1303:1882:25192        99      AAPE02062872    9055    60      94M     =       9311    317     TTTCCCGGCTCCCTTACCAATAGAGTTAGTCATGTGATATGGTTCCAGCCAATGAGTTGTAAACACAGGTTTTGAATGAGACTTCCTAGAAGTT +=CFHHHJJJJJJJJJJJJJJJJJJCFHIHJJJJIJJIJIJJGGFGHIIJJGIJGIGJJGIGIJJJHHHAHFFFDEEE@C@CDDDDCDCDCDC#  XT:A:U  NM:i:65 SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:0XO:i:0   XG:i:0  MD:Z:1G1T0T0T1C2T0T0A0G1G0G0A2A0G2C0A0C0C0A1T0G0A0A0A0T0T1G1A0T0G1T0T0T0A0G0G0T0T0C0A3A0A1G0T0T0T0T2T0A1G0C1C1T3A0G0A0A0A1A1T0T0T1A0A0G0
D3NH4HQ1:150:C1FTFACXX:7:1303:18099:51689       147     AAPE02062872    9311    60      61M     =       9055    -317    GATAAGCGATTGTATAGAAGCCTCAATTCACCAGACTCTATACTAAAAATGGGTGGATTTT :DBA?D@DCDEDDEDCDDDDDDCC@DCCCCDDDEEEEEEEEDFFFFHHHFHIJJJIJJJJJ    XT:A:U  NM:i:1  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:5C55
D3NH4HQ1:150:C1FTFACXX:7:2109:3484:63930        99      AAPE02062872    9055    60      94M     =       9313    317     NGAGTCTGATGATATTAAGTACAAGTAGATATAATAACTAGAATTCCCTTTCATTGCTACTAGAAGTACAAATGTTACAACCGTTTCGAAAAAT #4CFFHHJJJJJJJJJJJJJIJJJJHIJJJJJJJJJJJJJJJJJHIJJJJJJIJJHJJJJJJIJIJIIJIGIJJGIJIHHHHFFFEDDDDDDD5  XT:A:U  NM:i:71 SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:1XO:i:0   XG:i:0  MD:Z:0T1T0T1T0G0C0C1T0T0A0G1G0G1A0A1G0G1C0A0C0C2T0G2A0T0T0T0G0T0A0T0G2T0T0A0G0G1T0C0A0A1T0A1T0G0T0T0T0T0C0A0T0A0T0G0C1C0A1T0G2G0A0A0A0C0A0T0T0T0T3G0
D3NH4HQ1:150:C1FTFACXX:7:2109:14642:87311       147     AAPE02062872    9313    60      59M     =       9055    -317    TAAGCGATTGTATAGAAGCCTCAATTCACCAGACTCTATACTAAAAATGGGTGGATTTT   #DDDDDDDDDEEDDDDDDDDDDEEDDDDDDCC@DDEFEDCDDDEEEDDDFFFFFHHHHJ      XT:A:U  NM:i:1  SM:i:37 AM:i:37 X0:i:1  X1:i:0  XM:i:1  XO:i:0  XG:i:0  MD:Z:3C55
samtools bwa filtering • 2.6k views
ADD COMMENTlink modified 5.2 years ago by Biostar ♦♦ 20 • written 6.0 years ago by neal.platt220
1

Have you looked at the SAM file? It will be easier to diagnose this from there.

ADD REPLYlink written 6.0 years ago by Devon Ryan91k

I have updated with information from the SAM file. Thanks.

ADD REPLYlink written 6.0 years ago by neal.platt220
2
gravatar for swbarnes2
6.0 years ago by
swbarnes26.0k
United States
swbarnes26.0k wrote:

Well, first let's fix the mistake that is making it so ugly,.

MD:Z:0T1T0T1T0G0C0C1T0T0A0G1G0G1A0A1G0G1C0A0C0C2T0G2A0T0T0T0G0T0A0T0G2T0T0A0G0G1T0C0A0A1T0A1T0G0T0T0T0T0C0A0T0A0T0G0C1C0A1T0G2G0A0A0A0C0A0T0T0T0T3G0

That string is listing all the discrepancies between what the sai says the sequence of that read is, and what the fastq you gave it actually says. Possibly, you accidentally put the same fastq name twice in the sampe command, instead of two different ones. So redo that step, until that string look sensible, and maybe that will fix your other problem.

ADD COMMENTlink written 6.0 years ago by swbarnes26.0k

You are right I was using the wrong fastq file for the given .sai. I appreciate your help more than you know.

ADD REPLYlink written 6.0 years ago by neal.platt220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1630 users visited in the last hour