Question: markDuplicates: Histogram not produced
1
gravatar for dariober
4.4 years ago by
dariober9.9k
Glasgow - UK
dariober9.9k wrote:

Hi All- Picard/markDuplicates sometimes produces the histogram of sequencing vs gain and sometimes doesn't. Does anybody know how/why markDuplicates decides this?

(I haven't checked this carefully and I haven't looked at the code... apologies if I missed it in the docs.)

Edit: Java command:

java -jar -Xmx2g ~/bin/MarkDuplicates.jar I=input.bam O=/dev/null M=md.txt AS=true VALIDATION_STRINGENCY=SILENT

java -jar ~/bin/MarkDuplicates.jar --version
1.115(30b1e546cc4dd80c918e151dbfe46b061e63f315_1402927010)

Edit2: Examples of a metrics file with histogram (I stripped some rows) and one without

cat grm030_pb_BS.mm9_pb11.md.txt
## htsjdk.samtools.metrics.StringHeader
# picard.sam.MarkDuplicates INPUT=[/lustre/sblab/berald01/repository/bwameth/grm030_pb_BS.mm9_pb11.bam] OUTPUT=/dev/null METRICS_FILE=grm030_pb_BS.mm9_pb11.md.txt ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Aug 01 11:23:16 BST 2014

## METRICS CLASS    picard.sam.DuplicationMetrics
LIBRARY    UNPAIRED_READS_EXAMINED    READ_PAIRS_EXAMINED    UNMAPPED_READS    UNPAIRED_READ_DUPLICATES    READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION    ESTIMATED_LIBRARY_SIZE
    252563    3095840    1110601    38000    182731    156031    0.062608    160862710

## HISTOGRAM    java.lang.Double
BIN    VALUE
1.0    1.052566
2.0    2.085069
3.0    3.097891
4.0    4.091408
...
98.0    46.844894
99.0    47.00454
100.0    47.161142

This file has no histogram:

cat grm035_REBUiLT_AD06.clean.pb11.md.txt
## htsjdk.samtools.metrics.StringHeader
# picard.sam.MarkDuplicates INPUT=[/lustre/sblab/berald01/repository/bwameth/grm035_REBUiLT_AD06.clean.pb11.bam] OUTPUT=/dev/null METRICS_FILE=grm035_REBUiLT_AD06.clean.pb11.md.txt ASSUME_SORTED=true VALIDATION_STRINGENCY=SILENT    PROGRAM_RECORD_ID=MarkDuplicates PROGRAM_GROUP_NAME=MarkDuplicates REMOVE_DUPLICATES=false MAX_SEQUENCES_FOR_DISK_READ_ENDS_MAP=50000 MAX_FILE_HANDLES_FOR_READ_ENDS_MAP=8000 SORTING_COLLECTION_SIZE_RATIO=0.25 READ_NAME_REGEX=[a-zA-Z0-9]+:[0-9]:([0-9]+):([0-9]+):([0-9]+).* OPTICAL_DUPLICATE_PIXEL_DISTANCE=100 VERBOSITY=INFO QUIET=false COMPRESSION_LEVEL=5 MAX_RECORDS_IN_RAM=500000 CREATE_INDEX=false CREATE_MD5_FILE=false
## htsjdk.samtools.metrics.StringHeader
# Started on: Fri Aug 01 11:23:15 BST 2014

## METRICS CLASS    picard.sam.DuplicationMetrics
LIBRARY    UNPAIRED_READS_EXAMINED    READ_PAIRS_EXAMINED    UNMAPPED_READS    UNPAIRED_READ_DUPLICATES    READ_PAIR_DUPLICATES    READ_PAIR_OPTICAL_DUPLICATES    PERCENT_DUPLICATION    ESTIMATED_LIBRARY_SIZE
    46913    7889832    0    18667    74972    49840    0.010654    1220238098
Unknown Library    135291    21433156    0    96225    323809    143791    0.017298    1251759451

Thanks!

markduplicates picard • 2.4k views
ADD COMMENTlink modified 4.4 years ago • written 4.4 years ago by dariober9.9k

what's your java command line please ?

ADD REPLYlink written 4.4 years ago by Pierre Lindenbaum117k

Thanks for replying Pierre. Java command included, I thought there is nothing special there...

ADD REPLYlink written 4.4 years ago by dariober9.9k

and md.txt is empty or not produced ?

ADD REPLYlink written 4.4 years ago by Pierre Lindenbaum117k

I put a couple of example files, they probably look messy. I kind of suspect that if there are multiple read groups in the input bam file the is histogram is not produced, could it be the case?

ADD REPLYlink written 4.4 years ago by dariober9.9k
3
gravatar for Dan D
4.4 years ago by
Dan D6.7k
Tennessee
Dan D6.7k wrote:

In your most recent comment, you said:

I kind of suspect that if there are multiple read groups in the input bam file the is histogram is not produced, could it be the case?

I think you're on the right track. In the code for markduplicates/util/AbstractMarkDuplicatesCommandLineProgram.java:

172  if (metricsByLibrary.size() == 1) {

173       file.setHistogram(metricsByLibrary.values().iterator().next().calculateRoiHistogram());

174  }
ADD COMMENTlink written 4.4 years ago by Dan D6.7k
1
gravatar for dariober
4.4 years ago by
dariober9.9k
Glasgow - UK
dariober9.9k wrote:

Thanks Deedee for digging the code!

I tested on one of my files and it seems to be the case that markDuplicates doesn't produce the histogram with multiple read groups. Here's the code I've used if anybody cares:

Bam file with read groups:

bam=mybam.bam
java -jar -Xmx2g ~/bin/MarkDuplicates.jar I=$bam O=/dev/null M=test_RG.md.txt AS=true VALIDATION_STRINGENCY=SILENT

grep 'HISTOGRAM' test_RG.md.txt # Nothing grepped

Now strip the read group info from the bam file and re-run markDuplicates:

samtools view -@4 -h $bam | grep -v -P '^@RG\t' | sed -r 's/RG:.+?\t//' | samtools view -@4 -Sb - > test.$bam 

java -jar -Xmx2g ~/bin/MarkDuplicates.jar I=test.$bam O=/dev/null M=test.md.txt AS=true VALIDATION_STRINGENCY=SILENT

grep -A10 'HISTOGRAM' test.md.txt
## HISTOGRAM    java.lang.Double
BIN    VALUE
1.0    1.003927
2.0    1.97376
3.0    2.910658
4.0    3.815738
5.0    4.690082
6.0    5.534732
7.0    6.350698
8.0    7.138954
9.0    7.90044

 

 

ADD COMMENTlink written 4.4 years ago by dariober9.9k

Nicely done. That could be useful in the future. One could also use the AddOrReplaceReadGroups Picard tool.

ADD REPLYlink modified 4.4 years ago • written 4.4 years ago by Dan D6.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1897 users visited in the last hour