Hi,
I have used Picard MarkDuplicates tool to remove duplicates from my bam file. Understanding it correctly, if there are more than 1 reads starting at the same position, then after removing for duplicates, only 1 read will be present in the bam file generated from MARKDUPLICATES based on its quality(correct me if I am wrong here).
Analyzing paired end data I am showing you these 2 reads which I think should be considered as duplicates but MARKDUPLICATES does not consider it as duplicates
FRAGMENT 1
R1
READ1 99 chr2 88015806 255 17S133M = 88016003 340 GAAGTGCATTTTTTTTTTTTTTTTTTTGGTAGAGTGCCAGGTGCAGTGGCTCATACCTGTAATCCCAGCACTTTGGGGTTGGGGGTGAGGATCGCTTGATCCCAAAGCCCAAATCCCAGGAGTCTGAGACTAGTCTGGGTAACGCGGGGG DDDDDIIIIIIIIIIIIIIIIIIIIIHHI1FEEHE?CG1D@1<?<C1<1<1<C??H11<F@GEEHEH@C1<GHHIHCDH/<CHHI//0000DG0/DEC?/G</CGG/FHHHII?GHHHIH@.BG?A..9..B8B68..86CHG-,7EHHC MD:Z:133 PG:Z:MarkDuplicates NH:i:1 HI:i:1 jI:B:i,-1 NM:i:0 jM:B:c,-1 nM:i:0 AS:i:274
R2
READ1 147 chr2 88016003 255 143M7S = 88015806 -340 ACTAGCCCGTTGTTATGGCACGTGTCTGCTGAGAGGCTGATGTGGATGTTCGAACCCAGGAGTCCAAAGATGCAGTGAGTAAGCCAGGATCGCACTATCTCAAAAGACTGAGTCTCGCCATATTTGCTCAGGCTGATCTTGAAATACGCC HHFDGHHCHHGGIIHIHCHFIIIHIIIHIIHGHFIIGIIIIIIIIIIHGIIIHGFHHG<HCIHCFCF?FFFC1?IIIIIHFHHIHHCE@DIHFHEHGIHIHH?@G1FHHIHHIIHIHHIHGGHIIIIHHHGIIIIHHEHFF<CEHD@B@D MD:Z:143 PG:Z:MarkDuplicates NH:i:1 HI:i:1 jI:B:i,-1 NM:i:0 jM:B:c,-1 nM:i:0 AS:i:274
FRAGMENT 2
R1
READ2 99 chr2 88015806 255 15S68M41S = 88016003 340 GAAGTGCATTTTTTTTTTTTTTTTTGGTAGAGTGCCAGGTGCAGTGGCTCATACCTGTAATCCCAGCACTTTGGGGTTGGGGGGGAGGATCGCTTGATCCCAAAGCCCAAATCCCAGGAGTCTG DDDDDGHHHHIIIIIHDDDCDHHH///DE1F@<1DC1<D@F?DE1<@@1D1<<C@111<<1<DF<1<111<11<D<<<E/CEC/<??DD?E0E00<<//;FBG//AFHHI@/;:B/8B.78-BGMD:Z:68 PG:Z:MarkDuplicates NH:i:1 HI:i:1 jI:B:i,-1 NM:i:0 jM:B:c,-1 nM:i:0 AS:i:209
R2
READ2 147 chr2 88016003 255 143M7S = 88015806 -340 ACTAGCCCGTTGTTATGGCACGTGTCTGCTGAGAGGCTGATGTGGATGTTCGAACCCAGGAGTCCAAAGATGCAGTGAGTAAGCCAGGATCGCACTATCTCAAAAGACTGAGTCTCGCCATATTTGCTCAGGCTGATCTTGAAATACGCC DC</C0?<F?@1IHHEG<1EC?CG?HHG?ED1CC1C<HF?HGC?HCHGDF/C0CEF<11C1?C1F<<H@1HECDF1GGIHCHGECCHC1E?G@<C1@<@CEHGCEH@HHGGIHIIHHHEHHHHEHHEHHCFC@HHFF@HHHHEHHBDDAD MD:Z:143 PG:Z:MarkDuplicates NH:i:1 HI:i:1 jI:B:i,-1 NM:i:0 jM:B:c,-1 nM:i:0 AS:i:209
Now both the R1's have same start position and same sequence with READ2 being shorter than READ1. Similarly R2's also have same start position and same sequence. Why is this read not considered as PCR duplicates This is the command line used for finding duplicates
picard.jar MarkDuplicates INPUT= test.bam OUTPUT= test_dup_removed.bam METRICS_FILE= test_dup_removed_metrics REMOVE_DUPLICATES= true CREATE_INDEX= true
Thanks
it's not the same cigar string == not same sequence, and picard consider this, among other things (like the sample name, etc...)
https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates
Thanks Pierre. Can we provide a list of Barcodes and not 1 BARCODE sequence to BARCODE_TAG option? The library was indeed created with Unique Molecular Barcodes(96 of them)
that is another question.
Oh, have they changed their algorhithm or just the explanation? I have in mind, that picard MarkDuplicate doesn't compare any sequence, but look at the most 5' mapping position of the reads taking clipped bases into account.
According to the example above, both fragments have the same 5' mapping position in the sam field, but READ1 from Fragment1 have 17 soft clipped bases and Read1 from Fragment 2 only 15. So they have a different calculated mapping position.
fin swimmer
Hi finswimmer, I tried the same thing with samtools rmdup and it considers them as duplicates and retains the one with high mapping quality. Regarding the soft clip bases, I don't think soft clipped bases has to do with this thing because the position 88015806 in the above example is the position start of the alignment and not the soft clipped bases. Still puzzled why picard is giving different results than rmdup
You're right. The alignment starts with the first not soft clipped base. If I take the soft clipped bases into account for calculation where my fragment starts it is in your first fragment 17bases earlier and in your second fragment just 15. And therefore MarkDuplicates doesn't consider this as duplicates.
I've found the thread on seqanswers that I have in my mind: http://seqanswers.com/forums/showthread.php?t=5424 Have a look especially at post 7.
fin swimmer
Does picard takes soft clipped bases into account then?