Question

How Should I Split My Bams By Chromosomes ? For Which Operations ?

1

Entering edit mode

11.4 years ago

Pierre Lindenbaum 161k

On can split a Split Bam Files By Region For Parallel Variant Calling to speed up the processing of the BAMs.

But it cannot be so simple: if two reads have been mapped on two distinct chromosomes, I'm afraid some operations could lose some informations about the pair. So I suppose, I should create one extra bam file to save those pairs

In the following operations what are the places where we can safely work on a given chromosome:

MarkDuplicates
GATK: Indel Realignment
GATK recalibration
ValidateSamFile
FixMateInformation

do you have any experience with splitting the bams ? is it worth it ?

bam chromosome next-gen gatk picard • 4.1k views

ADD COMMENT • link updated 11.4 years ago by Stefano Berri 4.4k • written 11.4 years ago by Pierre Lindenbaum 161k

2

Entering edit mode

I look at it this way: if the pairs are mapped todifferent chromosomes then that pair would not be useful for many of the analyses anyhow. A behavior like that is likely due to either errors or some sort of structural variation in the genome (or combination of both) - but if the study is not designed to interpret the structural variants then losing some of them may not be relevant.

ADD REPLY • link 11.4 years ago by Istvan Albert 100k

0

Entering edit mode

Can you clarify why you want to split the bams? Is I/O your limiting factor when running parallel analysis?

ADD REPLY • link 11.4 years ago by Chris Miller 22k

0

Entering edit mode

No, I'm just thinking about how I may improve the speed of analysis for our new cluster and if it has already been done by someone.

ADD REPLY • link 11.4 years ago by Pierre Lindenbaum 161k

score 2 · Answer 1 · 2012-11-16

I think the indel Realignment is relatively "local" and so should be fine at chromosome per chromosome. Recalibration need a lot of reads to estimate the error rate, but if you have high coverage data for the whole chromosome, should be plenty. However, the file with all reads mapping in different chromosome might be not so representative and/or being enriched of "mismatches". Most of those pairs are chimeric artifacts.