Question: Split the bam file by chromosome and speed up picard Markduplicates
gravatar for lghust2011
23 months ago by
lghust201190 wrote:

I use picard to mark duplicates and found that picard dose not support multiple threads and it's very slow. To speed up it, I want to split BAM file by chromosome and then run picard on every file. The problem is, I know that picard has an advantage over samtools rmdup because picard can mark cross-chromosome duplicates. So if I split the bam file by chromosome, how important will it influence the result? Here is my consideration:

A pair of reads must come from the same DNA fragment, so these two reads mapped to the same chromosome normally. But at sometimes, these two reads mapped to different chromosome, maybe there is a structure variation or repeat such as microsatellite? If I just want to call SNV and indel, may I ignore the cross-chromosome duplicates? Please let me know if there is anything wrong with my consideration. Any reply will be much appreciated!

ADD COMMENTlink modified 23 months ago by Pierre Lindenbaum118k • written 23 months ago by lghust201190

Another way, if the influence is important, how can I compensate it?

ADD REPLYlink written 23 months ago by lghust201190

You can alternatively use Clumpify, which does duplicate-marking or duplicate-removal prior to mapping and is extremely fast.

ADD REPLYlink written 23 months ago by Brian Bushnell16k
gravatar for Pierre Lindenbaum
23 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

you could split your bam by both-mapped-chr1.bam, both-mapped-chr2.bam , both-mapped-chr3.bam , (..), and 'others.bam'

howeve I don't know if creating those new bams will reduce the computing time.

ADD COMMENTlink written 23 months ago by Pierre Lindenbaum118k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1428 users visited in the last hour