Picard Mark Duplicates
1
0
Entering edit mode
6.5 years ago
Varun Gupta ★ 1.3k

Hi,

I have used Picard MarkDuplicates tool to remove duplicates from my bam file. Understanding it correctly, if there are more than 1 reads starting at the same position, then after removing for duplicates, only 1 read will be present in the bam file generated from MARKDUPLICATES based on its quality(correct me if I am wrong here).

Analyzing paired end data I am showing you these 2 reads which I think should be considered as duplicates but MARKDUPLICATES does not consider it as duplicates

FRAGMENT 1

R1
READ1   99  chr2    88015806    255 17S133M =   88016003    340 GAAGTGCATTTTTTTTTTTTTTTTTTTGGTAGAGTGCCAGGTGCAGTGGCTCATACCTGTAATCCCAGCACTTTGGGGTTGGGGGTGAGGATCGCTTGATCCCAAAGCCCAAATCCCAGGAGTCTGAGACTAGTCTGGGTAACGCGGGGG  DDDDDIIIIIIIIIIIIIIIIIIIIIHHI1FEEHE?CG1D@1<?<C1<1<1<C??H11<F@GEEHEH@C1<GHHIHCDH/<CHHI//0000DG0/DEC?/G</CGG/FHHHII?GHHHIH@.BG?A..9..B8B68..86CHG-,7EHHC  MD:Z:133    PG:Z:MarkDuplicates NH:i:1  HI:i:1  jI:B:i,-1   NM:i:0  jM:B:c,-1   nM:i:0  AS:i:274

R2
READ1   147 chr2    88016003    255 143M7S  =   88015806    -340    ACTAGCCCGTTGTTATGGCACGTGTCTGCTGAGAGGCTGATGTGGATGTTCGAACCCAGGAGTCCAAAGATGCAGTGAGTAAGCCAGGATCGCACTATCTCAAAAGACTGAGTCTCGCCATATTTGCTCAGGCTGATCTTGAAATACGCC  HHFDGHHCHHGGIIHIHCHFIIIHIIIHIIHGHFIIGIIIIIIIIIIHGIIIHGFHHG<HCIHCFCF?FFFC1?IIIIIHFHHIHHCE@DIHFHEHGIHIHH?@G1FHHIHHIIHIHHIHGGHIIIIHHHGIIIIHHEHFF<CEHD@B@D  MD:Z:143    PG:Z:MarkDuplicates NH:i:1  HI:i:1  jI:B:i,-1   NM:i:0  jM:B:c,-1   nM:i:0  AS:i:274

FRAGMENT 2

R1 
READ2   99  chr2    88015806    255 15S68M41S   =   88016003    340 GAAGTGCATTTTTTTTTTTTTTTTTGGTAGAGTGCCAGGTGCAGTGGCTCATACCTGTAATCCCAGCACTTTGGGGTTGGGGGGGAGGATCGCTTGATCCCAAAGCCCAAATCCCAGGAGTCTG    DDDDDGHHHHIIIIIHDDDCDHHH///DE1F@<1DC1<D@F?DE1<@@1D1<<C@111<<1<DF<1<111<11<D<<<E/CEC/<??DD?E0E00<<//;FBG//AFHHI@/;:B/8B.78-BGMD:Z:68 PG:Z:MarkDuplicates NH:i:1  HI:i:1  jI:B:i,-1   NM:i:0  jM:B:c,-1   nM:i:0  AS:i:209

R2
READ2   147 chr2    88016003    255 143M7S  =   88015806    -340    ACTAGCCCGTTGTTATGGCACGTGTCTGCTGAGAGGCTGATGTGGATGTTCGAACCCAGGAGTCCAAAGATGCAGTGAGTAAGCCAGGATCGCACTATCTCAAAAGACTGAGTCTCGCCATATTTGCTCAGGCTGATCTTGAAATACGCC  DC</C0?<F?@1IHHEG<1EC?CG?HHG?ED1CC1C<HF?HGC?HCHGDF/C0CEF<11C1?C1F<<H@1HECDF1GGIHCHGECCHC1E?G@<C1@<@CEHGCEH@HHGGIHIIHHHEHHHHEHHEHHCFC@HHFF@HHHHEHHBDDAD  MD:Z:143    PG:Z:MarkDuplicates NH:i:1  HI:i:1  jI:B:i,-1   NM:i:0  jM:B:c,-1   nM:i:0  AS:i:209

Now both the R1's have same start position and same sequence with READ2 being shorter than READ1. Similarly R2's also have same start position and same sequence. Why is this read not considered as PCR duplicates This is the command line used for finding duplicates

picard.jar MarkDuplicates INPUT= test.bam OUTPUT= test_dup_removed.bam METRICS_FILE= test_dup_removed_metrics REMOVE_DUPLICATES= true CREATE_INDEX= true

Thanks

picard markduplicates • 8.1k views
ADD COMMENT
1
Entering edit mode

it's not the same cigar string == not same sequence, and picard consider this, among other things (like the sample name, etc...)

https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates

The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file. An BARCODE_TAG option is available to facilitate duplicate marking using molecular barcodes. After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks reads by the sums of their base-quality scores (default method).

ADD REPLY
0
Entering edit mode

Thanks Pierre. Can we provide a list of Barcodes and not 1 BARCODE sequence to BARCODE_TAG option? The library was indeed created with Unique Molecular Barcodes(96 of them)

ADD REPLY
1
Entering edit mode

that is another question.

ADD REPLY
0
Entering edit mode

The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.

Oh, have they changed their algorhithm or just the explanation? I have in mind, that picard MarkDuplicate doesn't compare any sequence, but look at the most 5' mapping position of the reads taking clipped bases into account.

According to the example above, both fragments have the same 5' mapping position in the sam field, but READ1 from Fragment1 have 17 soft clipped bases and Read1 from Fragment 2 only 15. So they have a different calculated mapping position.

fin swimmer

ADD REPLY
0
Entering edit mode

Hi finswimmer, I tried the same thing with samtools rmdup and it considers them as duplicates and retains the one with high mapping quality. Regarding the soft clip bases, I don't think soft clipped bases has to do with this thing because the position 88015806 in the above example is the position start of the alignment and not the soft clipped bases. Still puzzled why picard is giving different results than rmdup

ADD REPLY
0
Entering edit mode

Regarding the soft clip bases, I don't think soft clipped bases has to do with this thing because the position 88015806 in the above example is the position start of the alignment and not the soft clipped bases.

You're right. The alignment starts with the first not soft clipped base. If I take the soft clipped bases into account for calculation where my fragment starts it is in your first fragment 17bases earlier and in your second fragment just 15. And therefore MarkDuplicates doesn't consider this as duplicates.

I've found the thread on seqanswers that I have in my mind: http://seqanswers.com/forums/showthread.php?t=5424 Have a look especially at post 7.

fin swimmer

ADD REPLY
0
Entering edit mode

Does picard takes soft clipped bases into account then?

ADD REPLY
0
Entering edit mode
6.4 years ago

Aren't you looking for picardtools MarkDuplicatesWithMateCigar ?

ADD COMMENT

Login before adding your answer.

Traffic: 1697 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6