Question: How Should I Split My Bams By Chromosomes ? For Which Operations ?
6.4 years ago
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum wrote:

On can split a Split Bam Files By Region For Parallel Variant Calling to speed up the processing of the BAMs.

But it cannot be so simple: if two reads have been mapped on two distinct chromosomes, I'm afraid some operations could lose some informations about the pair. So I suppose, I should create one extra bam file to save those pairs

In the following operations what are the places where we can safely work on a given chromosome:

  • MarkDuplicates
  • GATK: Indel Realignment
  • GATK recalibration
  • ValidateSamFile
  • FixMateInformation

do you have any experience with splitting the bams ? is it worth it ?

gatk next-gen picard bam chromosome
Pierre Lindenbaum

I look at it this way: if the pairs are mapped todifferent chromosomes then that pair would not be useful for many of the analyses anyhow. A behavior like that is likely due to either errors or some sort of structural variation in the genome (or combination of both) - but if the study is not designed to interpret the structural variants then losing some of them may not be relevant.

Istvan Albert

Can you clarify why you want to split the bams? Is I/O your limiting factor when running parallel analysis?

Chris Miller

No, I'm just thinking about how I may improve the speed of analysis for our new cluster and if it has already been done by someone.

Pierre Lindenbaum
6.4 years ago
Stefano Berri
Cambridge, UK
Stefano Berri wrote:

I think the indel Realignment is relatively "local" and so should be fine at chromosome per chromosome. Recalibration need a lot of reads to estimate the error rate, but if you have high coverage data for the whole chromosome, should be plenty. However, the file with all reads mapping in different chromosome might be not so representative and/or being enriched of "mismatches". Most of those pairs are chimeric artifacts.

Stefano Berri
