Question: picard tools HsMetrics not using duplicate reads to calculate coverage
1
gravatar for Floydian_slip
4.2 years ago by
Floydian_slip130
United States
Floydian_slip130 wrote:

Hi, I am using Picard CollectHsMetrics tool to obtain a bunch of statistics about the regions in my targeted panel such as mean coverage and mean coverage for each interval. Since the panel is small, there is a huge number of duplicate reads and CollectHsMetrics only takes into account the unique reads and so the reported mean coverage is smaller than it actually is. Is there a switch that I can turn on/off to take into account all reads (incl duplicate reads) and not just unique reads.

Here is how I run it:

java -jar picard.jar CollectHsMetrics I=sample.bam O=sample.realigned.metrics TI=intervals.bed PER_TARGET_COVERAGE=sample.target_region.coverage BI=intervals.bed REFERENCE_SEQUENCE=hg19.fa

BTW, the input file (sample.bam) has gone through the step of duplicates marking. Even if I run CollectHsMetrics before marking duplicates, the results are the same.

Any help will be appreciated. Thanks!

hsmetrics picard duplicates • 2.7k views
ADD COMMENTlink modified 2.2 years ago by Biostar ♦♦ 20 • written 4.2 years ago by Floydian_slip130

Usually, if there are N duplicated reads, N-1 of them are marked as duplicates. Here's a little example of a before/after picard MarkDuplicates:

enter image description here

I've made A1 and A2 duplicates, and A3 and A4 unique - and Picard correctly marks only A1 as duplicated.

So to answer your question, coverage - to me anyway - is the amount of bases of the genome covered by sequenced reads. To others, it's the amount of the genome covered by fragments/templates that enter the sequencer. For others, it's the amount of genome covered by signal - which obviously depends on your assay.

In all of these situations however, duplicates should not effect the coverage calculated. Do you mean depth?

ADD REPLYlink modified 4.2 years ago • written 4.2 years ago by John12k

Yes, sorry, I meant depth. HsMetrics is reporting a mean depth of 2X in regions where I can see from the bam file in IGV that there are 30 reads mapping in that location. Please advise.

ADD REPLYlink written 4.2 years ago by Floydian_slip130
0
gravatar for Pierre Lindenbaum
4.2 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum129k wrote:

since the panel is small, there is a huge number of duplicate reads and CollectHsMetrics only takes into account the unique reads and so the reported mean coverage is smaller than it actually is.

yes, https://github.com/broadinstitute/picard/blob/master/src/java/picard/analysis/directed/TargetMetricsCollector.java#L432

 if (!record.getDuplicateReadFlag()) { // ignore duplicates for unique reads/bases

Is there a switch that I can turn on/off to take into account all reads (incl duplicate reads) and not just unique reads.

as far as I can see in the code: no

Solution; reset the duplicate flag with RevertSam : A: Remove flags of MarkDuplicates (picard)

ADD COMMENTlink written 4.2 years ago by Pierre Lindenbaum129k

Thanks, Pierre. I can try that. However, as I mentioned above that running HsMetrics on the sorted bam file which has not undergone duplicate marking gives the same results: Lower depth than actual due to duplicate reads. So, apparently, HsMetrics does not take duplicate reads into account even at that stage. Any thoughts?

ADD REPLYlink written 4.2 years ago by Floydian_slip130
1

it could be the mapq==0 ? some other flags (fails quality , supplementary, etc... )

ADD REPLYlink written 4.2 years ago by Pierre Lindenbaum129k

Pierre, sorry, I did not follow you; you suggest mapq==0 as a switch in HsMetrics? BTW, are there any other tools out there that can do the same: provide depth information on the intervals (supplied in a bed format) and some other post-alignment quality metrics? Thanks!

ADD REPLYlink written 4.2 years ago by Floydian_slip130
1

if your reads have a mapq==0 they will be ignored in HsMetrics

are there any other tools out there that can do the same: provide depth information on the intervals

GATK :Depth of Coverage.

ADD REPLYlink written 4.2 years ago by Pierre Lindenbaum129k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1493 users visited in the last hour