Question: Remove optical duplicates from bam file
7
gravatar for Anna S
4.2 years ago by
Anna S500
PSU
Anna S500 wrote:

Hello,

After searching in this user group and analyzing the output of picard MarkDuplicates, it seems that picard does not output which of the reads it determined to be optical duplicates.  That is, picard gives the number of optical and total duplicates in their metric file, but it only marks in its output bam file which reads are duplicates (optical or otherwise), and not which of those are optical duplicates.

I was requested to find a way to remove the optical duplicates and the only way I can think of doing this is to change the picard source code itself to output such a flag into the bam file (which could then be used to easily filter on).  Before embarking on this task however, I'd like to know if someone has already done this and would this person be so kind as to share the code to spare me the trouble?

Thanks a lot!!!

Anna

sequencing • 5.5k views
ADD COMMENTlink modified 4.1 years ago • written 4.2 years ago by Anna S500

Unfortunately, I expect you'll need to tweak the source code. Since the BAM spec doesn't even distinguish between optical and PCR duplicates I doubt anyone considered someone wanting to do this. Out of curiousity, what sort of situation caused you to need this?
 

ADD REPLYlink written 4.2 years ago by Devon Ryan86k

I'm not sure, actually, because I don't have access to the science problem in this instance!!  I work with someone who does sequencing for a fee, and one of her clients from France would like to remove the optical duplicates from the sequences she generated because they're excessive (without the optical duplicates the ENCODE metric 'pcr bottlenecking' would be in the desired 'mild bottlenecking' range but with the optical duplicates they're in the undesirable 'moderate bottlenecking' range).  I'm just helping her out.
 

ADD REPLYlink written 4.2 years ago by Anna S500
5

You'll will have to sort the read by names, create a group with the reads in the same group/lane/tile having the same position on the genome and the same cigar string, extract the X/Y positions of the reads on the flowcell( http://en.wikipedia.org/wiki/FASTQ_format#Illumina_sequence_identifiers ), retrieve those too close together and keep the one with the highest quality.

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Pierre Lindenbaum114k

Thanks, Pierre!  That's probably a better solution than to customize MarkDuplicates! 

ADD REPLYlink written 4.2 years ago by Anna S500

But MarkDuplicates invokes the code doing that job: see line 50 https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/AbstractDuplicateFindingAlgorithm.java . May be you'll 'just' have to subclass that class to do the job. You're problem is to disable the code removing the PCR dups.

ADD REPLYlink written 4.2 years ago by Pierre Lindenbaum114k

You've been so helpful, Pierre, one last question... would the distance be euclidean: sqrt((x2-x1)**2 + (y2-y1)**2) <= 100 (OPTICAL_DUPLICATE_PIXEL_DISTANCE) ?

That would be a simple script.  My customer would be very happy!!  Thanks again!

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Anna S500
1

see https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/AbstractDuplicateFindingAlgorithm.java#L176

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Pierre Lindenbaum114k
1

Pierre, I'm CONVINCED that the MarkDuplicates code is wrong based on your description!!

When I wrote a script based on your description, I got only ~1k optical duplicates as opposed to MarkDuplicates ~3 million.  I then wrote a very simple script that simply counts how many duplicates share the same tile, and that was < 4k.  The overall number of duplicates matched (~7 million).  I'm convinced my script is right, it's such a simple one.  I'm willing to share my code, is there a way to attach a file here?

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Anna S500

http://gist.github.com ?

ADD REPLYlink written 4.2 years ago by Pierre Lindenbaum114k
3

I'm attaching a remove optical duplicates script, and a count tile duplicates script.

Both take as input a sam file sorted on chr and startPos. They also assume that when the sequence name is parsed by ":" then the tile is the 5th field, x the 6th and y the 7th (e.g. HWI-ST1318:119:H89A3ADXX:1:2209:1705:6933, where tile is '2209', x is '1705' and y is'6933').  Finally, they assume that the file is for a single lane, as I was working with such files.

The remove optical duplicates code is in

The count tile duplicates code is in

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Anna S500
9
gravatar for Anna S
4.1 years ago by
Anna S500
PSU
Anna S500 wrote:

I contacted the Broad Institute, and Nils Homer (samtools team) informed me that they already knew about one bug relating to this.  I tested the fix, and this bug affected not only the number of optical duplicates but also the estimated library size statistic.  To get the bug fix you need version >= Picard 1.120.

Unfortunately I still think there are additional bug(s), and I'm working with Nils on this.  I'll let you know what else I find out!!

ADD COMMENTlink written 4.1 years ago by Anna S500
4

I have found another bug with MarkDuplicates!!  I suspect there's more too but I've had a gizillion deadlines in the past several days and so I haven't looked into it yet; perhaps next week sometime.

In particular, I ran bwa and then picard MarkDuplicates v1.122 (which incorporates the fix above) on lane 1 and lane 2 files separately, and I then did the same for the merged fastq files.  I got the picard MarkDuplicates results below.  Note that the overall % duplication is higher in the merged file than in the individual lane ones (to be expected), but that the OPTICAL duplicates is also higher, that is, it is more than the sum of the OPTICAL duplicates of lanes 1 and 2 (NOT to be expected).

In other words, it is expected that the overall duplication is higher than the sum of the ones in lanes 1 and 2 due to interaction between the reads of lanes 1 and 2 (for ex., read1 from lane1 could have no duplicates in lane1, and read2 from lane2 could have no duplicate in lane2 but in the merged file they could be duplicates of each other).  However, by definition, there should be no interaction in the OPTICAL duplicates between lanes 1 and 2, and therefore the results below suggest that the MarkDuplicates optical duplication algorithm is not checking for the lane.

Sample                 LIBRARY               UNPAIRED_READS_EXAMINED READ_PAIRS_EXAMINED             UNMAPPED_READS                UNPAIRED_READ_DUPLICATES READ_PAIR_DUPLICATES             READ_PAIR_OPTICAL_DUPLICATES                PERCENT_DUPLICATION               ESTIMATED_LIBRARY_SIZE

A_merged          Unknown Library             497938  21072701             939669  326848  9821728                3875890                0.46831                 18726206

A_L1                      Unknown Library             247592  10518632             458565  141958  3109570                1074927                0.298856                18640119

A_L2                      Unknown Library             250347  10554068             481105  147968  3130326                1080576                0.30005                 18605087

 

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Anna S500
2

I have heard back from Nils Homer regarding the lane issue above, and his answer is that MarkDuplicates requires AddOrReplaceReadGroups (or equivalent) to work properly since it extracts the tile,x and y from the name but not the lane.  Here is his full answer:

"Thanks Anna for the example set.  I have observed a few things regarding this issue

The first is that we do not extract the lane # from the read name, only tile, x-coordinate, and y-coordinate.  You can see this in the code here if you are interested: https://github.com/broadinstitute/picard/blob/master/src/java/picard/sam/markduplicates/util/OpticalDuplicateFinder.java#L84-L104

Secondly, we also do not retrieve either the barcode information or library identifier in the read name, since they themselves are not embedded in the read name.  Both barcode and library identifier are also important to condition upon when searching for optical duplicates, or duplicates in general.  

This brings us to where *do* we expect to retrieve this information?  We use the read group header lines to capture lane, barcode, library, flowcell (for Illumina) and other information for specific sets or groups of reads.  If this information is given, which I recommend that as a best practice it should, MarkDuplicates will behave as you expect.  I believe it is much more robust to annotate these metadata in the header rather than rely on parsing read names wholly, since read name structures do change, albeit infrequently.

I would recommend adding read groups to your SAM header within your pipeline.  We use FastqToSam or IlluminaBasecallsToSam to set the read group appropriately depending on our inputs.  In Picard, we also have tools like AddOrReplaceReadGroups that can help you add read groups prior to marking duplicates.

Nils"

ADD REPLYlink written 4.1 years ago by Anna S500
4

Thanks Anna for the great detective work and sharing back what you have learned.

You are like The Little Engine That Could - keep on pushing the issue until it is clarified! 

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Istvan Albert ♦♦ 78k
1

Istvan, I wish I were even more so because I'm convinced I have found yet another bug in their optical duplicates where their number is too high (higher even than the duplicates with the same tile!), but it's awkward for me to report yet a third problem to Dr. Homer.  I think at this point I'd probably have to look at their code and pinpoint the problem, sigh.

ADD REPLYlink written 4.1 years ago by Anna S500

I am sure they want to know as well - and that they would not want to knowingly run code that is not right.

ADD REPLYlink written 4.1 years ago by Istvan Albert ♦♦ 78k

I have just counted a small example BY HAND, and of course it matched my script (!, haha).  When I count the number of duplicates in the same tile for a small example, I get a total of 12.  However, picard MarkDuplicates finds 25 optical duplicate pairs, that is 50.   (the total number of duplicates from picard is correct).

Ok guys, I have just sent NIls Homer this...  I'll let you know what he says.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Anna S500
1

Thanks for all your work, Anna! Sorry to bug you but is there any news on this? Can we now trust MarkDuplicates in, for example, version 2.01?

ADD REPLYlink written 2.7 years ago by terdon410

Yes - i'm just about to fire off MarkDuplicates but if it's wrong, then I might have a problem :P
Obviously we want to mark optical duplicates and PCR duplicates as different somehow (new tag?), or maybe repurpose the Vendor Failed QC flag, since that never seems to get set anyway. 

ADD REPLYlink written 2.7 years ago by John12k

Hi Anna,

Would you mind telling me where I can get the latest version of picard (may be the v1.122 you have mentioned)? The latest version of it on http://sourceforge.net/projects/picard/ was v1.119.

Thank you!

ADD REPLYlink written 3.6 years ago by Emma10

You can get newer versions here.
 

ADD REPLYlink written 3.6 years ago by Devon Ryan86k

@Devon Ryan, thanks for the link ;)

ADD REPLYlink written 3.6 years ago by Emma10

The picard fix decreases the number of optical duplicates by 27-36%.  As I said, I still think the number they report is too high though.

ADD REPLYlink modified 4.1 years ago • written 4.1 years ago by Anna S500
0
gravatar for Pierre Lindenbaum
4.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum114k wrote:

http://picard.sourceforge.net/command-line-overview.shtml#MarkDuplicates

REMOVE_DUPLICATES=Boolean	If true do not write duplicates to the output file instead of writing them with appropriate flags set. Default value: false.
ADD COMMENTlink written 4.2 years ago by Pierre Lindenbaum114k

I think what you propose removes all duplicates, not just the optical duplicates?  MarkDuplicates does work to determine which of the duplicates are optical but this has very limited value as it does not output into its bam file an optical flag.  My customer would like to remove only the optical duplicates but leave the 'pcr' ones for analysis.  Thanks if you have any tips for this!!

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by Anna S500

OK just optical dups. sorry, I didn't understood your question at first sight.

ADD REPLYlink written 4.2 years ago by Pierre Lindenbaum114k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1124 users visited in the last hour