How Should I Split My Bams By Chromosomes ? For Which Operations ?
1
1
Entering edit mode
11.4 years ago

On can split a Split Bam Files By Region For Parallel Variant Calling to speed up the processing of the BAMs.

But it cannot be so simple: if two reads have been mapped on two distinct chromosomes, I'm afraid some operations could lose some informations about the pair. So I suppose, I should create one extra bam file to save those pairs

In the following operations what are the places where we can safely work on a given chromosome:

  • MarkDuplicates
  • GATK: Indel Realignment
  • GATK recalibration
  • ValidateSamFile
  • FixMateInformation

do you have any experience with splitting the bams ? is it worth it ?

bam chromosome next-gen gatk picard • 4.1k views
ADD COMMENT
2
Entering edit mode

I look at it this way: if the pairs are mapped todifferent chromosomes then that pair would not be useful for many of the analyses anyhow. A behavior like that is likely due to either errors or some sort of structural variation in the genome (or combination of both) - but if the study is not designed to interpret the structural variants then losing some of them may not be relevant.

ADD REPLY
0
Entering edit mode

Can you clarify why you want to split the bams? Is I/O your limiting factor when running parallel analysis?

ADD REPLY
0
Entering edit mode

No, I'm just thinking about how I may improve the speed of analysis for our new cluster and if it has already been done by someone.

ADD REPLY
2
Entering edit mode
11.4 years ago

I think the indel Realignment is relatively "local" and so should be fine at chromosome per chromosome. Recalibration need a lot of reads to estimate the error rate, but if you have high coverage data for the whole chromosome, should be plenty. However, the file with all reads mapping in different chromosome might be not so representative and/or being enriched of "mismatches". Most of those pairs are chimeric artifacts.

ADD COMMENT

Login before adding your answer.

Traffic: 3846 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6