Question: How Should I Split My Bams By Chromosomes ? For Which Operations ?
1
gravatar for Pierre Lindenbaum
6.4 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum118k wrote:

On can split a Split Bam Files By Region For Parallel Variant Calling to speed up the processing of the BAMs.

But it cannot be so simple: if two reads have been mapped on two distinct chromosomes, I'm afraid some operations could lose some informations about the pair. So I suppose, I should create one extra bam file to save those pairs

In the following operations what are the places where we can safely work on a given chromosome:

  • MarkDuplicates
  • GATK: Indel Realignment
  • GATK recalibration
  • ValidateSamFile
  • FixMateInformation

do you have any experience with splitting the bams ? is it worth it ?

gatk next-gen picard bam chromosome • 2.6k views
ADD COMMENTlink written 6.4 years ago by Pierre Lindenbaum118k
2

I look at it this way: if the pairs are mapped todifferent chromosomes then that pair would not be useful for many of the analyses anyhow. A behavior like that is likely due to either errors or some sort of structural variation in the genome (or combination of both) - but if the study is not designed to interpret the structural variants then losing some of them may not be relevant.

ADD REPLYlink written 6.4 years ago by Istvan Albert ♦♦ 79k

Can you clarify why you want to split the bams? Is I/O your limiting factor when running parallel analysis?

ADD REPLYlink written 6.4 years ago by Chris Miller20k

No, I'm just thinking about how I may improve the speed of analysis for our new cluster and if it has already been done by someone.

ADD REPLYlink written 6.4 years ago by Pierre Lindenbaum118k
2
gravatar for Stefano Berri
6.4 years ago by
Stefano Berri4.1k
Cambridge, UK
Stefano Berri4.1k wrote:

I think the indel Realignment is relatively "local" and so should be fine at chromosome per chromosome. Recalibration need a lot of reads to estimate the error rate, but if you have high coverage data for the whole chromosome, should be plenty. However, the file with all reads mapping in different chromosome might be not so representative and/or being enriched of "mismatches". Most of those pairs are chimeric artifacts.

ADD COMMENTlink modified 6.4 years ago • written 6.4 years ago by Stefano Berri4.1k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1390 users visited in the last hour