Question: Problem using PicardTools MarkDuplicates: Value Was Put Into Pairinfomap More Than Once
1
gravatar for anp375
3.5 years ago by
anp375160
anp375160 wrote:

High, I'm having a problem using PicardTools MarkDuplicates. I had two fastq files for forward and reverse reads. They were both from the same lane, etc. I aligned them with bwa mem using the -M option and got the bam file. I used PicardTools FixMateInformation, SortSam, and samtools index. When I use MarkDuplicates, I get this error:

Exception in thread "main" net.sf.picard.PicardException: Value was put into PairInfoMap more than once. 4: Sample_AD_096:HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 at net.sf.picard.sam.CoordinateSortedPairInfoMap.ensureSequenceLoaded(CoordinateSortedPairInfoMap.java:124) at net.sf.picard.sam.CoordinateSortedPairInfoMap.remove(CoordinateSortedPairInfoMap.java:78) at net.sf.picard.sam.DiskReadEndsMap.remove(DiskReadEndsMap.java:61) at net.sf.picard.sam.MarkDuplicates.buildSortedReadEndLists(MarkDuplicates.java:407) at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:150) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:177) at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:134) [E::hts_open] fail to open file 'Sample_AD_096_b37_aln-pe.bam'

When I grepped the read name, I got 4 lines back: Primary forward read, primary reverse read, separated by a SAMFlag value of 80 or 16. Secondary forward read, secondary reverse read separated by a SAMFlag value of 80 or 16.

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 131 2 233666062 6 0 35S66M 5 154316645 0 CCCATCGTCGGAGCTGGTGGCACAGAGGTTGT ACTGACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTT BBBFFFFF FFFFFIFFFBFFFFFIFBFFBFFFFFIFFFFFII<ffffiiifffiiiiiiffffffbbbbffffbbbbbbfffffffff ffffff<bbfffb="" sa:z:5,154316645,+,39m62s,60,0;="" md:z:66="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:66="" xs:i:44<="" strong="">

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 115 2 233666062 6 0 32S69M 5 154316648 0 ATCGTCGGAGCTGGTGGCACAGAGGTTGTACT GACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTTGGG 77FFBFFF FFFFFFFFFFFFFBBBBBFFFFFFB7BBFIIIFFFFFBFFFFFFFIIFBFFFFFFIIIFBFIF<f<fbfbbfffff<fbf ffffffffffbbb="" sa:z:5,154316648,-,36m65s,60,0;="" md:z:69="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:69="" xs:i:46<="" strong="">

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 387 5 154316645 6 0 39M62H 2 233666062 0 CCCATCGTCGGAGCTGGTGGCACAGAGGTTGT ACTGACC BBBFFFFFFFFFFIFFFBFFFFFIFBFFBFFFFFIFFFF SA:Z:2,233666062,+,35S66M,60,0;M D:Z:39 RG:Z:Sample_AD_096 NM:i:0 MQ:i:60 AS:i:39 XS:i:0

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 371 5 154316648 6 0 36M65H 2 233666062 0 ATCGTCGGAGCTGGTGGCACAGAGGTTGTACT GACC 77FFBFFFFFFFFFFFFFFFFBBBBBFFFFFFB7BB SA:Z:2,233666062,-,32S69M,60,0;M D:Z:36 RG:Z:Sample_AD_096 NM:i:0 MQ:i:60 AS:i:36 XS:i:0

I tried isolating primary alignments only, with samtools view -b -F 0x4 -F 0x100 -F 0x800 both.bam > primary.bam, but got the same error even though a grep showed only the primary forward and reverse reads separated by a value of 80 or 16. I did remember to index this. I don't know how to solve this. I have two fastq files so I can't use MergeBamAlignment. Should I combine the fastq files into one? My grep output from the primaries only:

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 131 2 233666062 6 0 35S66M 5 154316645 0 CCCATCGTCGGAGCTGGTGGCACAGAGGTTGT ACTGACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTT BBBFFFFF FFFFFIFFFBFFFFFIFBFFBFFFFFIFFFFFII<ffffiiifffiiiiiiffffffbbbbffffbbbbbbfffffffff ffffff<bbfffb="" sa:z:5,154316645,+,39m62s,60,0;="" md:z:66="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:66="" xs:i:44<="" strong="">

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 115 2 233666062 6 0 32S69M 5 154316648 0 ATCGTCGGAGCTGGTGGCACAGAGGTTGTACT GACCCTGTCTCTACAAATAATAAAAAAGCCAGGCATGGTGGTATGCACCTGTAGTCCCAGCTACTTGGG 77FFBFFF FFFFFFFFFFFFFBBBBBFFFFFFB7BBFIIIFFFFFBFFFFFFFIIFBFFFFFFIIIFBFIF<f<fbfbbfffff<fbf ffffffffffbbb="" sa:z:5,154316648,-,36m65s,60,0;="" md:z:69="" rg:z:sample_ad_096="" n="" m:i:0="" mq:i:60="" as:i:69="" xs:i:46<="" strong="">

ADD COMMENTlink modified 3.5 years ago by Matt Shirley9.1k • written 3.5 years ago by anp375160
2
gravatar for Matt Shirley
3.5 years ago by
Matt Shirley9.1k
Cambridge, MA
Matt Shirley9.1k wrote:

It looks like you have actual duplicate reads based on the read name and not the flags. e.g.:

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 131 2 233666062 6 0 35S66M 5 154316645 0 ...  
HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 115 2 233666062 6 0 32S69M 5 154316648 0 ...  

My guess is that you had some newish data from a HiSeq where the read pair identifiers come after the read names, separated by a space:

HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 1:x:xx:xxxxx 131 2 233666062 6 0 35S66M 5 154316645 0 ...  
HWI-ST1341:97:C7CKRACXX:8:2201:11012:61476 2:x:xx:xxxxx 115 2 233666062 6 0 32S69M 5 154316648 0 ...  

One of your pre-processing steps probably stripped everything in the read name after the space, and now Picard thinks these are duplicates.

I forgot what the exact format for the 1:x:xx:xxxxx portion is, but you can find the definition from illumina. It will be something like 1:Y:0:ATCGAT which is read number:passing filter:control flag:barcode.

ADD COMMENTlink written 3.5 years ago by Matt Shirley9.1k
1

Thank you. I think something weird happened with the assignments. The secondary reads seem to be assigned as mates to the primary reads.

ADD REPLYlink written 3.5 years ago by anp375160

Glad you figured it out.

ADD REPLYlink written 3.5 years ago by Matt Shirley9.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 889 users visited in the last hour