picard tools HsMetrics not using duplicate reads to calculate coverage
1
1
Entering edit mode
5.1 years ago
Floydian_slip ▴ 150

Hi, I am using Picard CollectHsMetrics tool to obtain a bunch of statistics about the regions in my targeted panel such as mean coverage and mean coverage for each interval. Since the panel is small, there is a huge number of duplicate reads and CollectHsMetrics only takes into account the unique reads and so the reported mean coverage is smaller than it actually is. Is there a switch that I can turn on/off to take into account all reads (incl duplicate reads) and not just unique reads.

Here is how I run it:

java -jar picard.jar CollectHsMetrics I=sample.bam O=sample.realigned.metrics TI=intervals.bed PER_TARGET_COVERAGE=sample.target_region.coverage BI=intervals.bed REFERENCE_SEQUENCE=hg19.fa


BTW, the input file (sample.bam) has gone through the step of duplicates marking. Even if I run CollectHsMetrics before marking duplicates, the results are the same.

Any help will be appreciated. Thanks!

picard duplicates HsMetrics • 3.2k views
0
Entering edit mode

Usually, if there are N duplicated reads, N-1 of them are marked as duplicates. Here's a little example of a before/after picard MarkDuplicates:

I've made A1 and A2 duplicates, and A3 and A4 unique - and Picard correctly marks only A1 as duplicated.

So to answer your question, coverage - to me anyway - is the amount of bases of the genome covered by sequenced reads. To others, it's the amount of the genome covered by fragments/templates that enter the sequencer. For others, it's the amount of genome covered by signal - which obviously depends on your assay.

In all of these situations however, duplicates should not effect the coverage calculated. Do you mean depth?

0
Entering edit mode

Yes, sorry, I meant depth. HsMetrics is reporting a mean depth of 2X in regions where I can see from the bam file in IGV that there are 30 reads mapping in that location. Please advise.

0
Entering edit mode
5.1 years ago

since the panel is small, there is a huge number of duplicate reads and CollectHsMetrics only takes into account the unique reads and so the reported mean coverage is smaller than it actually is.

 if (!record.getDuplicateReadFlag()) { // ignore duplicates for unique reads/bases


Is there a switch that I can turn on/off to take into account all reads (incl duplicate reads) and not just unique reads.

as far as I can see in the code: no

Solution; reset the duplicate flag with RevertSam : A: Remove flags of MarkDuplicates (picard)

0
Entering edit mode

Thanks, Pierre. I can try that. However, as I mentioned above that running HsMetrics on the sorted bam file which has not undergone duplicate marking gives the same results: Lower depth than actual due to duplicate reads. So, apparently, HsMetrics does not take duplicate reads into account even at that stage. Any thoughts?

1
Entering edit mode

it could be the mapq==0 ? some other flags (fails quality , supplementary, etc... )

0
Entering edit mode

Pierre, sorry, I did not follow you; you suggest mapq==0 as a switch in HsMetrics? BTW, are there any other tools out there that can do the same: provide depth information on the intervals (supplied in a bed format) and some other post-alignment quality metrics? Thanks!

1
Entering edit mode

if your reads have a mapq==0 they will be ignored in HsMetrics

are there any other tools out there that can do the same: provide depth information on the intervals

GATK :Depth of Coverage.