Question

picard markduplicate output smaller file

0

Entering edit mode

5.3 years ago

Peter Chung ▴ 200

I am new in WGS analysis. First, I combine all my bam files into one and it's 157GB and then add read group on it to 159GB. Then I do the picard markduplicate step by using the following code:

java -Xmx8g -Djava.io.tmpdir=${TMPFILE} -jar $PICARD MarkDuplicates \
INPUT=${FILE}.addRG.bam \
OUTPUT=${FILE}.addRG.mkdup.bam \
ASSUME_SORTED=true \
TMP_DIR=${TMPFILE} \
REMOVE_DUPLICATES=true \
READ_NAME_REGEX=null \
METRICS_FILE=${FILE}_Metrics.txt

It returns no error but the output file is 18Gb and there is not metrics file generated. I don't know what happened, any advice? Thanks.

The result outputs from picard MarkDuplicates, but there is no error inside.

[Fri Jan 18 08:53:43 UTC 2019] picard.sam.markduplicates.MarkDuplicates done. Elapsed time: 111.00 minutes.
Runtime.totalMemory()=7318011904
To get help, see http://broadinstitute.github.io/picard/index.html#GettingHelp
Exception in thread "main" java.lang.IllegalArgumentException: Alignments added out of order in 
SAMFileWriterImpl.addAlignment for file:///data/data/Samples/CHS                         
/SRS006915/SRS006915.addRG.mkdup.bam. Sort order is coordinate. Offending records are at [*:0] 
and [chrM:1]
    at htsjdk.samtools.SAMFileWriterImpl.assertPresorted(SAMFileWriterImpl.java:213)
    at htsjdk.samtools.SAMFileWriterImpl.addAlignment(SAMFileWriterImpl.java:200)
    at picard.sam.markduplicates.MarkDuplicates.doWork(MarkDuplicates.java:406)
    at picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:282)
    at picard.cmdline.PicardCommandLine.instanceMain(PicardCommandLine.java:98)
    at picard.cmdline.PicardCommandLine.main(PicardCommandLine.java:108)

picard markduplicates gatk • 2.6k views

ADD COMMENT • link updated 11 months ago by Ram 43k • written 5.3 years ago by Peter Chung ▴ 200

0

Entering edit mode

Hello Peter Chung ,

the message you are showing is an error. You can see it by the word Exception. A quick web search suggest, that the sorting order given in the header is different to the alignment. People who uses Picard's ReorderSam seem to have this problem.

So the questions are:

How did you combine your bam files?
How did you sort them?

fin swimmer

ADD REPLY • link 5.3 years ago by finswimmer 16k

0

Entering edit mode

oh thanks. First I used bwa to align them and then use samtools sort to sort each bam files. Afterwards, I combined all the bam files into one bam file by samtools merge. After that, I used samtools addreplacerg to add readgroup.

bwa and samtools sort

REF="/data/data/reference/refs/ucsc.hg19.fasta.gz"
for f in $(ls -l *.bam | awk '$5 < 90000000000 {print $9}' | awk -F"_" '{print $1}'); do
    bwa mem -M -t 8 $REF ${f}_1.fastq.gz ${f}_2.fastq.gz | samtools sort > ${f}_sorted.bam;
done

samtools merge

FNAME=(`pwd | awk -F"/" '{print $6}'`)
LIST=$(for file in *.bam; do echo -n "$file "; done)
samtools merge -nthreads=8 ${FNAME}.bam $LIST

samtools addreplacerg

samtools addreplacerg -r 'ID:${name}' \
-r 'LB:lib1' \
-r 'PL:illumina' \
-r 'PU:unit1' \
-r 'SM:${GP}.${name}' \
-o ${name}.addRG.bam ${name}.bam

any advice? thanks.

ADD REPLY • link updated 5.3 years ago by finswimmer 16k • written 5.3 years ago by Peter Chung ▴ 200

0

Entering edit mode

Hmm, I cannot see any crucial thing. What version of samtools and picard are you using? Also maybe we can see something in the header of the input file for MarkDuplicate (samtools view -H input.bam).

BTW: You can define the ReadGroup already with bwa. Then no extra step with samtools addreplacerg is neccessary.

ADD REPLY • link 5.3 years ago by finswimmer 16k

0

Entering edit mode

Hi, have you tried increasing the heap size? also, check the TMP_DIR location has more than 159 GB free space.

ADD REPLY • link 5.3 years ago by Arup Ghosh 3.2k