Question: Picard Mark Duplicates
0
gravatar for Varun Gupta
7 months ago by
Varun Gupta1.1k
United States
Varun Gupta1.1k wrote:

Hi,

I have used Picard MarkDuplicates tool to remove duplicates from my bam file. Understanding it correctly, if there are more than 1 reads starting at the same position, then after removing for duplicates, only 1 read will be present in the bam file generated from MARKDUPLICATES based on its quality(correct me if I am wrong here).

Analyzing paired end data I am showing you these 2 reads which I think should be considered as duplicates but MARKDUPLICATES does not consider it as duplicates

FRAGMENT 1

R1
READ1   99  chr2    88015806    255 17S133M =   88016003    340 GAAGTGCATTTTTTTTTTTTTTTTTTTGGTAGAGTGCCAGGTGCAGTGGCTCATACCTGTAATCCCAGCACTTTGGGGTTGGGGGTGAGGATCGCTTGATCCCAAAGCCCAAATCCCAGGAGTCTGAGACTAGTCTGGGTAACGCGGGGG  DDDDDIIIIIIIIIIIIIIIIIIIIIHHI1FEEHE?CG1D@1<?<C1<1<1<C??H11<F@GEEHEH@C1<GHHIHCDH/<CHHI//0000DG0/DEC?/G</CGG/FHHHII?GHHHIH@.BG?A..9..B8B68..86CHG-,7EHHC  MD:Z:133    PG:Z:MarkDuplicates NH:i:1  HI:i:1  jI:B:i,-1   NM:i:0  jM:B:c,-1   nM:i:0  AS:i:274

R2
READ1   147 chr2    88016003    255 143M7S  =   88015806    -340    ACTAGCCCGTTGTTATGGCACGTGTCTGCTGAGAGGCTGATGTGGATGTTCGAACCCAGGAGTCCAAAGATGCAGTGAGTAAGCCAGGATCGCACTATCTCAAAAGACTGAGTCTCGCCATATTTGCTCAGGCTGATCTTGAAATACGCC  HHFDGHHCHHGGIIHIHCHFIIIHIIIHIIHGHFIIGIIIIIIIIIIHGIIIHGFHHG<HCIHCFCF?FFFC1?IIIIIHFHHIHHCE@DIHFHEHGIHIHH?@G1FHHIHHIIHIHHIHGGHIIIIHHHGIIIIHHEHFF<CEHD@B@D  MD:Z:143    PG:Z:MarkDuplicates NH:i:1  HI:i:1  jI:B:i,-1   NM:i:0  jM:B:c,-1   nM:i:0  AS:i:274

FRAGMENT 2

R1 
READ2   99  chr2    88015806    255 15S68M41S   =   88016003    340 GAAGTGCATTTTTTTTTTTTTTTTTGGTAGAGTGCCAGGTGCAGTGGCTCATACCTGTAATCCCAGCACTTTGGGGTTGGGGGGGAGGATCGCTTGATCCCAAAGCCCAAATCCCAGGAGTCTG    DDDDDGHHHHIIIIIHDDDCDHHH///DE1F@<1DC1<D@F?DE1<@@1D1<<C@111<<1<DF<1<111<11<D<<<E/CEC/<??DD?E0E00<<//;FBG//AFHHI@/;:B/8B.78-BGMD:Z:68 PG:Z:MarkDuplicates NH:i:1  HI:i:1  jI:B:i,-1   NM:i:0  jM:B:c,-1   nM:i:0  AS:i:209

R2
READ2   147 chr2    88016003    255 143M7S  =   88015806    -340    ACTAGCCCGTTGTTATGGCACGTGTCTGCTGAGAGGCTGATGTGGATGTTCGAACCCAGGAGTCCAAAGATGCAGTGAGTAAGCCAGGATCGCACTATCTCAAAAGACTGAGTCTCGCCATATTTGCTCAGGCTGATCTTGAAATACGCC  DC</C0?<F?@1IHHEG<1EC?CG?HHG?ED1CC1C<HF?HGC?HCHGDF/C0CEF<11C1?C1F<<H@1HECDF1GGIHCHGECCHC1E?G@<C1@<@CEHGCEH@HHGGIHIIHHHEHHHHEHHEHHCFC@HHFF@HHHHEHHBDDAD  MD:Z:143    PG:Z:MarkDuplicates NH:i:1  HI:i:1  jI:B:i,-1   NM:i:0  jM:B:c,-1   nM:i:0  AS:i:209

Now both the R1's have same start position and same sequence with READ2 being shorter than READ1. Similarly R2's also have same start position and same sequence. Why is this read not considered as PCR duplicates This is the command line used for finding duplicates

picard.jar MarkDuplicates INPUT= test.bam OUTPUT= test_dup_removed.bam METRICS_FILE= test_dup_removed_metrics REMOVE_DUPLICATES= true CREATE_INDEX= true

Thanks

markduplicates picard • 895 views
ADD COMMENTlink modified 5 months ago by marcela.uliano50 • written 7 months ago by Varun Gupta1.1k
1

it's not the same cigar string == not same sequence, and picard consider this, among other things (like the sample name, etc...)

https://broadinstitute.github.io/picard/command-line-overview.html#MarkDuplicates

The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file. An BARCODE_TAG option is available to facilitate duplicate marking using molecular barcodes. After duplicate reads are collected, the tool differentiates the primary and duplicate reads using an algorithm that ranks reads by the sums of their base-quality scores (default method).

ADD REPLYlink written 7 months ago by Pierre Lindenbaum114k

Thanks Pierre. Can we provide a list of Barcodes and not 1 BARCODE sequence to BARCODE_TAG option? The library was indeed created with Unique Molecular Barcodes(96 of them)

ADD REPLYlink written 7 months ago by Varun Gupta1.1k
1

that is another question.

ADD REPLYlink written 7 months ago by Pierre Lindenbaum114k

The MarkDuplicates tool works by comparing sequences in the 5 prime positions of both reads and read-pairs in a SAM/BAM file.

Oh, have they changed their algorhithm or just the explanation? I have in mind, that picard MarkDuplicate doesn't compare any sequence, but look at the most 5' mapping position of the reads taking clipped bases into account.

According to the example above, both fragments have the same 5' mapping position in the sam field, but READ1 from Fragment1 have 17 soft clipped bases and Read1 from Fragment 2 only 15. So they have a different calculated mapping position.

fin swimmer

ADD REPLYlink written 7 months ago by finswimmer6.9k

Hi finswimmer, I tried the same thing with samtools rmdup and it considers them as duplicates and retains the one with high mapping quality. Regarding the soft clip bases, I don't think soft clipped bases has to do with this thing because the position 88015806 in the above example is the position start of the alignment and not the soft clipped bases. Still puzzled why picard is giving different results than rmdup

ADD REPLYlink written 7 months ago by Varun Gupta1.1k

Regarding the soft clip bases, I don't think soft clipped bases has to do with this thing because the position 88015806 in the above example is the position start of the alignment and not the soft clipped bases.

You're right. The alignment starts with the first not soft clipped base. If I take the soft clipped bases into account for calculation where my fragment starts it is in your first fragment 17bases earlier and in your second fragment just 15. And therefore MarkDuplicates doesn't consider this as duplicates.

I've found the thread on seqanswers that I have in my mind: http://seqanswers.com/forums/showthread.php?t=5424 Have a look especially at post 7.

fin swimmer

ADD REPLYlink modified 7 months ago • written 7 months ago by finswimmer6.9k

Does picard takes soft clipped bases into account then?

ADD REPLYlink written 7 months ago by Varun Gupta1.1k
0
gravatar for marcela.uliano
5 months ago by
European Union
marcela.uliano50 wrote:

Aren't you looking for picardtools MarkDuplicatesWithMateCigar ?

ADD COMMENTlink written 5 months ago by marcela.uliano50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 962 users visited in the last hour