Question: Split the bam file by chromosome and speed up picard Markduplicates
gravatar for lghust2011
2.7 years ago by
lghust201190 wrote:

I use picard to mark duplicates and found that picard dose not support multiple threads and it's very slow. To speed up it, I want to split BAM file by chromosome and then run picard on every file. The problem is, I know that picard has an advantage over samtools rmdup because picard can mark cross-chromosome duplicates. So if I split the bam file by chromosome, how important will it influence the result? Here is my consideration:

A pair of reads must come from the same DNA fragment, so these two reads mapped to the same chromosome normally. But at sometimes, these two reads mapped to different chromosome, maybe there is a structure variation or repeat such as microsatellite? If I just want to call SNV and indel, may I ignore the cross-chromosome duplicates? Please let me know if there is anything wrong with my consideration. Any reply will be much appreciated!

ADD COMMENTlink modified 2.7 years ago by Pierre Lindenbaum125k • written 2.7 years ago by lghust201190

Another way, if the influence is important, how can I compensate it?

ADD REPLYlink written 2.7 years ago by lghust201190

You can alternatively use Clumpify, which does duplicate-marking or duplicate-removal prior to mapping and is extremely fast.

ADD REPLYlink written 2.7 years ago by Brian Bushnell17k
gravatar for Pierre Lindenbaum
2.7 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum125k wrote:

you could split your bam by both-mapped-chr1.bam, both-mapped-chr2.bam , both-mapped-chr3.bam , (..), and 'others.bam'

howeve I don't know if creating those new bams will reduce the computing time.

ADD COMMENTlink written 2.7 years ago by Pierre Lindenbaum125k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2206 users visited in the last hour