I am a researcher in the Marth Lab at Boston College, which has developed Mosaik and GigaBayes. Over the past six months, I have completely rewritten GigaBayes to enable its efficient use on the >1k sample datasets which are presently being generated by the 1000 Genomes Project. I suggest you look into using it if you are interested in small indel calling.
The new program is called FreeBayes. In addition to a host of improvements in interface, reliability, and algorithmic flexibility, FreeBayes should provide several orders of magnitude improvement in runtime performance over GigaBayes and BamBayes. It uses the same basic population-based Bayesian framework as its predecessors to segregate true variant events from sequencing and alignment artifacts. We have provided it under a liberal open source (MIT) license. We haven't yet submitted a publication describing the work, but we would love to provide the system to users for testing.
In its simplest operation, FreeBayes uses BAM alignment file(s) and the corresponding FASTA reference sequence to generate a VCF report describing variant individuals and sites:
% freebayes -f h.sapiens.fasta -v variants.vcf NA20504.bam
(This would analyze all positions in
h.sapiens.fasta for which
NA20504.bam has coverage.)
FreeBayes now detects insertion, deletion, MNP (multi-base mismatches), and "complex" allelic variants by default.
The core insight of the algorithm is that a neutral model of the likely distribution of alleles in a population can be used to improve our detection efficiency of true events within a population of individuals. (FreeBayes models this distribution using the Ewens Sampling Formula.) Thus, whenever possible, multiple samples from the same or closely related species should be used. Where such information is not available, the reference may be counted as an additional sample using the
To my knowledge, FreeBayes is significantly different than other variant detection systems in common use in that it is not limited to the analysis of haploid or diploid individuals. The assumed ploidy or copy number of the samples is not fixed, and can be set to any number (via the
--ploidy flag). This enables the extension of the algorithm to variant calling in species with more than 2 copies of each locus. Additionally, the results of pooled sequencing experiments may be analyzed by setting ploidy equal to the number of alleles per site in the pooled population. I plan to enable sequence and region-specific configuration of copy number in the very near future, as this is directly applicable to variant calling in the 1000 Genomes.
Another useful feature is that FreeBayes can read BAM on its standard input. This allows the application of custom input filters or base-quality adjustment methods to alignments without requiring that the alignment files be rewritten. For instance, this command will apply samtools BAQ adjustment to aln.bam and then call SNPs, writing VCF output to standard output:
% samtools fillmd -br aln.bam | freebayes -f reference.fasta
As a core component in the NCBI variant calling pipeline, FreeBayes is currently under very active development. Please contact me with any questions, feature requests, or bug reports. My email is listed on the FreeBayes github page.
I'd also love to collaborate with anyone working on interesting variant detection problems!