Question

Big differences with CollectHsMetrics

0

Entering edit mode

6.5 years ago

finswimmer 16k

Hello,

I tested two analyse pipelines for targeted paired end DNA sequencing:

mapping/alignment with bwa mem, followed by picards MarkDuplicates
merge overlapping paired reads with bbmerge, mapping /alignment with bwa mem, followed by picards MarkDuplicates

I than run CollectHsMetrics. This is the result:

BAIT_SET    GENOME_SIZE BAIT_TERRITORY  TARGET_TERRITORY    BAIT_DESIGN_EFFICIENCY  TOTAL_READS PF_READS    PF_UNIQUE_READS PCT_PF_READS    PCT_PF_UQ_READS PF_UQ_READS_ALIGNED PCT_PF_UQ_READS_ALIGNED PF_BASES_ALIGNED    PF_UQ_BASES_ALIGNED ON_BAIT_BASES   NEAR_BAIT_BASES OFF_BAIT_BASES  ON_TARGET_BASES PCT_SELECTED_BASES  PCT_OFF_BAIT    ON_BAIT_VS_SELECTED MEAN_BAIT_COVERAGE  MEAN_TARGET_COVERAGE    MEDIAN_TARGET_COVERAGE  PCT_USABLE_BASES_ON_BAIT    PCT_USABLE_BASES_ON_TARGET  FOLD_ENRICHMENT ZERO_CVG_TARGETS_PCT    PCT_EXC_DUPE    PCT_EXC_MAPQ    PCT_EXC_BASEQ   PCT_EXC_OVERLAP PCT_EXC_OFF_TARGET  FOLD_80_BASE_PENALTY    PCT_TARGET_BASES_1X PCT_TARGET_BASES_2X PCT_TARGET_BASES_10X    PCT_TARGET_BASES_20X    PCT_TARGET_BASES_30X    PCT_TARGET_BASES_40X    PCT_TARGET_BASES_50X    PCT_TARGET_BASES_100X   HS_LIBRARY_SIZE HS_PENALTY_10X  HS_PENALTY_20X  HS_PENALTY_30X  HS_PENALTY_40X  HS_PENALTY_50X  HS_PENALTY_100X AT_DROPOUT  GC_DROPOUT  HET_SNP_SENSITIVITY HET_SNP_Q   SAMPLE  LIBRARY READ_GROUP
A   3095693983  6850692 6850692 1   21580058    21580058    18330522    1   0,849419    18306903    0,998711    3075391546  2599640506  1447696719  709339799   918355028   889936855   0,701386    0,298614    0,671151    211,321239  129,904666  123 0,46798 0,287679    212,716292  0,010396    0,154696    0,029064    0,005036    0,25095 0,656279    1,732062    0,990236    0,989601    0,98554 0,977191    0,963095    0,941365    0,911599    0,652761    18811463    5,176015    5,299345    5,426413    5,557217    5,707456    6,649249    3,923906    2,925639    0,98916 20          
B   3095693983  6850692 6850692 1   12559951    12559951    6959842 1   0,55413 6933545 0,996222    2140104653  1150765565  960680816   542955905   636467932   369181943   0,7026  0,2974  0,638905    140,231208  53,88973    53  0,445238    0,171102    202,846579  0,01021 0,462285    0,020848    0,003892    0,011729    0,506522    1,456479    0,990324    0,989492    0,982661    0,958973    0,89456 0,764513    0,573995    0,020449    7065860 4,791253    5,088152    5,438156    5,862988    6,395959    -1  2,330678    2,474334    0,988827    20

Until now I cannot explain where the big differences between the coverage data come from. I thought that CollectHsMetrics counts overlapping paired reads as 1?

Any ideas?

fin swimmer

picard collectHsMetrics • 2.7k views

ADD COMMENT • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

What did you do with bbmerge output? It can output merged reads and the unmerged reads. Did you map both merged and unmerged reads (then combine the BAM files) or just mapped only the merged reads with bwa mem?

ADD REPLY • link 6.5 years ago by jean.elbers ★ 1.7k

0

Entering edit mode

Sorry, i have forgot to mention it. I mapped both, the merged and unmerged reads and combined the bam files afterwards.

fin swimmer

ADD REPLY • link 6.5 years ago by finswimmer 16k

0

Entering edit mode

6.5 years ago

finswimmer 16k

It seems that I have found at least some reasons for the big differences.

At first, the version of picard tools I used was 2.9. The current version is 2.14. In the earlier version there was a bug handling overlapping paired reads. With the new version the differences in the coverage are now slightly smaller.

The second reason is, that I use the default settings, which filter out reads with mapping quality lower then 20. It seems, that I have quite a lot of region with coverage around 700 using the pipeline without merging before. The most reads have mapping quality between 20 and 25. In the pipeline with merging, these quality values falls below 20 and I ended up with coverages around 50 in this regions. So the whole mean coverage calculation is biased. But this not clearly explain why the median coverage is just the half in the second pipeline.

fin swimmer

ADD COMMENT • link 6.5 years ago by finswimmer 16k

score 1 · Accepted Answer · 2017-11-08

Finally I could solve it.

The problem was how I use MarkDuplicates and how MarkDuplicates work.

I run it on the combined bam file which contain the single end merged read and the paired end reads that could not be merged. MarkDuplicate consider only about the 5' mapping positition. The mixture of single end and paired seems to lead to much more reads that are marked as duplicates.

So my next try was to run MarkDuplicates on the bam files before combine them. That was better, but the mean coverage was still far away from the pipeline where I was not merging my reads before. The problem in this way was, that the merged overlapping paired reads hav a high number of short reads (50-60 bases) that share the 5' mapping position - but not the 3' prime end. As MarkDuplicates just look at the 5' position again a lot of reads were tagged as Duplicates which aren't.

So finally I switch to clumpify using its dedupe function based on sequence similarity before doing anything else with my fastq files. The results look good.

fin swimmer