Question: picard markduplicate output smaller file
22 months ago
Peter Chung
Hong Kong
Peter Chung wrote:

I am new in WGS analysis. First, I combine all my bam files into one and it's 157GB and then add read group on it to 159GB. Then I do the picard markduplicate step by using the following code:

java -Xmx8g${TMPFILE} -jar $PICARD MarkDuplicates \
INPUT=${FILE}.addRG.bam \
OUTPUT=${FILE}.addRG.mkdup.bam \

It returns no error but the output file is 18Gb and there is not metrics file generated. I don't know what happened, any advice? Thanks.

The result otuputs from picard markduplicate, but there is no error inside.

[Fri Jan 18 08:53:43 UTC 2019] picard.sam.markduplicates.MarkDuplicates done. El                         apsed time: 111.00 minutes.
To get help, see
Exception in thread "main" java.lang.IllegalArgumentException: Alignments added                          out of order in 
SAMFileWriterImpl.addAlignment for file:///data/data/Samples/CHS                         
/SRS006915/SRS006915.addRG.mkdup.bam. Sort order is coordinate. Offending record                         s are at [*:0] 
and [chrM:1]
    at htsjdk.samtools.SAMFileWriterImpl.assertPresorted(SAMFileWriterImpl.j                         ava:213)
    at htsjdk.samtools.SAMFileWriterImpl.addAlignment(                         :200)
    at picard.sam.markduplicates.MarkDuplicates.doWork(                         06)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.jav                         a:282)
    at picard.cmdline.PicardCommandLine.instanceMain(                         98)
    at picard.cmdline.PicardCommandLine.main(
Hello Peter Chung ,

the message you are showing is an error. You can see it by the word Exception. A quick web search suggest, that the sorting order given in the header is different to the alignment. People who uses Picard's ReorderSam seem to have this problem.

So the questions are:

  • How did you combine your bam files?
  • How did you sort them?

fin swimmer

oh thanks. First I used bwa to align them and then use samtools sort to sort each bam files. Afterwards, I combined all the bam files into one bam file by samtools merge. After that, I used samtools addreplacerg to add readgroup.

bwa and samtools sort

for f in $(ls -l *.bam | awk '$5 < 90000000000 {print $9}' | awk -F"_" '{print $1}'); do
    bwa mem -M -t 8 $REF ${f}_1.fastq.gz ${f}_2.fastq.gz | samtools sort > ${f}_sorted.bam;

samtools merge

FNAME=(`pwd | awk -F"/" '{print $6}'`)
LIST=$(for file in *.bam; do echo -n "$file "; done)
samtools merge -nthreads=8 ${FNAME}.bam $LIST

samtools addreplacerg

samtools addreplacerg -r 'ID:${name}' \
-r 'LB:lib1' \
-r 'PL:illumina' \
-r 'PU:unit1' \
-r 'SM:${GP}.${name}' \
-o ${name}.addRG.bam ${name}.bam

any advice? thanks.

Hmm, I cannot see any crucial thing. What version of samtools and picard are you using? Also maybe we can see something in the header of the input file for MarkDuplicate (samtools view -H input.bam).

BTW: You can define the ReadGroup already with bwa. Then no extra step with samtools addreplacerg is neccessary.

Hi, have you tried increasing the heap size? also, check the TMP_DIR location has more than 159 GB free space.

