Question: Split Bam Files By Region For Parallel Variant Calling
3
gravatar for William
8.6 years ago by
William4.7k
Europe
William4.7k wrote:

To speed up variant calling (SNP / Indel) on a set of large BAM files I want to do the variant calling in paralell on a Sun Grid Engine cluster.

I already managed to split the BAMs by reference (chromosome) using https://github.com/pezmaster31/bamtools/ . Afterward I use vcf tools to concatenate the chromosome VCF files to one sample VCF file.

These bamfiles are still relatively big so I would like to split the BAMS further on large N polymer sequences in the reference. I already have a file with the location of these large N polymer sequences in our reference.

Is there a tool that already has the functionality to split BAM files by N polymer sequences?

Or what would be the easiest way to use this region information to split BAM files?

bam • 5.5k views
ADD COMMENTlink modified 8.6 years ago by Vivek2.5k • written 8.6 years ago by William4.7k

Is I/O your limiting factor here? if not, just place a single bam on a network-accessible disk, and use samtools to extract the reads you need. (samtools view file.bam 2:1234-5678 | variantcaller)

ADD REPLYlink modified 8.6 years ago • written 8.6 years ago by Chris Miller21k

I can't pipe the samtools output, I need to pass a bam file path to an 3rd party variant caller.

ADD REPLYlink modified 8.6 years ago • written 8.6 years ago by William4.7k

Not a direct answer but GATK 2 has support for reduced bams now so they might reduce your bam sizes for multi sample calling.

but of course if your issue is getting more chunks to parallelise then I think the samtools -L bedfile seems like a good idea

ADD REPLYlink written 8.6 years ago by Kevin640
3
gravatar for Vivek
8.6 years ago by
Vivek2.5k
Denmark
Vivek2.5k wrote:

You don't necessarily have to split the bam file to run your jobs in parallel on the grid. Variant callers like GATK unified genotyper can operate on a single chromosome of a BAM if you just specify the region like "-L chr:1-1000.."

ADD COMMENTlink written 8.6 years ago by Vivek2.5k

I thought the iontorrent modified GATK variant caller didn't support this but seems to do work with bed files. So I can use the whole bams together with bed files.

ADD REPLYlink written 8.6 years ago by William4.7k
2
gravatar for Damian Kao
8.6 years ago by
Damian Kao15k
USA
Damian Kao15k wrote:

You can use samtools: http://samtools.sourceforge.net/

In samtools view, there is a -L parameter that'll allow you to extract reads that overlap with a specified BED file:

-L FILE  output alignments overlapping the input BED FILE

Convert your N-polymer coordinates into a bed file and extract reads that only overlap those coordinates with the -L parameter. You'll have to make a bed file that only has regions without the N-polymer coordinates to extract everything else. I don't think there is an option to exclude regions in samtools.

ADD COMMENTlink modified 8.6 years ago • written 8.6 years ago by Damian Kao15k

Hi Damian,

Your post is almost 3 years old but I am currently thinking about this problem.

If you split your chromosome into 1Kb for example, what about the reads that start in the 0-1Kb chunk and finish in the 1Kb-2Kb chunks? There are in both chunks right? Isn't that a problem? Doesn't it induce errors?

ADD REPLYlink written 5.8 years ago by mathieu.bahin60

I am not sure how samtools handles the -L parameter or if it even still implements it. If samtools will grab any reads that intersects (totally or partially) the region defined the bed file, then it should be fine. The only problem is that you might get duplicated calls among consecutive region's vcfs which you can check for and take out.

ADD REPLYlink written 5.8 years ago by Damian Kao15k

Thanks for the answer.

Yes, then I want to merge my vcfs.

I was just worried that 2 vcfs could disagree about an event (like 0-1Kb chunk vcf says that position 1008 is a sequencing error and 1Kb-2Kb chunk vcf says that it is a SNP).

ADD REPLYlink written 5.8 years ago by mathieu.bahin60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1022 users visited in the last hour
_