Question

Marking Duplicates With Molecular Tag (Umi)

0

Entering edit mode

10.9 years ago

brentp 24k

I have some reads that contain a molecular index so I can know whether they are PCR duplicates. I am going to use the 0x400 flag as specified in the SAM spec to mark them as optical/PCR duplicates.

Should I mark all of the reads in a group (having the same POS and molecular tag) with that flag or should I leave the one with the highest quality (or by whatever metric) unmarked?

I will be sending the result to the GATK SNP-calling pipeline.

markduplicates index • 5.5k views

ADD COMMENT • link updated 21 months ago by Ram 43k • written 10.9 years ago by brentp 24k

0

Entering edit mode

I'm not sure I understand:are your SAM records already marked with the flag 0x4 ? or are you looking for a method to set the flag according to the chrom/pos/your-index ?

ADD REPLY • link 10.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

No, they are mapped reads. I want to set the flag 0x400 (1024) to show that they are PCR duplicates.

ADD REPLY • link 10.9 years ago by brentp 24k

0

Entering edit mode

Hi Brent,

Did you ever get this working?

Is the code available for download somewhere? I need it but don't want to re-invent the wheel.

Thank you

ADD REPLY • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Michele Busby ★ 2.2k

0

Entering edit mode

Hi Brent,

I know this is old, but in case anyone else needs this, I have code here to add the UMI to the bam file by reading information from an original FASTQ: https://github.com/mbusby/AddUMIsToBam in the RX and QX fields.

The Picard MarkDuplicates, I hear from that team but did not test myself yet, will handle this in its duplicate marking.

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.2 years ago by Michele Busby ★ 2.2k

score 1 · Answer 1 · 2013-05-23

1

Entering edit mode

10.9 years ago

Pierre Lindenbaum 161k

you have to leave one read with the highest quality and flag the others with the duplicate flag.

here is the code of picard markdup: all the reads are flagged but the best: (https://github.com/nh13/picard/blob/master/src/java/net/sf/picard/sam/MarkDuplicates.java )

for (final ReadEnds end : list) {
    if (end != best) {
        addIndexAsDuplicate(end.read1IndexInFile);
        addIndexAsDuplicate(end.read2IndexInFile);
    }

ADD COMMENT • link 10.9 years ago by Pierre Lindenbaum 161k

0

Entering edit mode

yes, if I understand correctly, that seems to be the logic here: https://picard.svn.sourceforge.net/svnroot/picard/trunk/src/java/net/sf/picard/sam/MarkDuplicates.java

though I don't see them sorting by quality.

ADD REPLY • link 10.9 years ago by brentp 24k

1

Entering edit mode

in the source it is stored in the member "store"

private short getScore(final SAMRecord rec) {
        short score = 0;
        for (final byte b : rec.getBaseQualities()) {
            if (b >= 15) score += b;
        }
(...)
     pairedEnds.score += getScore(rec);

and then this score is used to get the best pair:

for (final ReadEnds end : list) {
            if (end.score > maxScore || best == null) {
                maxScore = end.score;
                best = end;
            }
        }