Question: What Methods Do You Use For In/Del/Snp Calling?
gravatar for Pierre Lindenbaum
7.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum99k wrote:

After you've mapped your short reads on a reference sequence, what is your favorite method/workflow to detect some (new) SNP/insertions/deletions. Did you compare various strategies ?

On my side, I played with MAQ/SAM/BWA but I wonder if there is 'widely' adopted (robust) workflow to find some new SNPs.

Thanks Pierre

short aligner snp sequencing • 22k views
ADD COMMENTlink modified 4.0 years ago by Biostar ♦♦ 20 • written 7.5 years ago by Pierre Lindenbaum99k
gravatar for Erik Garrison
7.0 years ago by
Erik Garrison2.0k
Somerville, MA
Erik Garrison2.0k wrote:

I am a researcher in the Marth Lab at Boston College, which has developed Mosaik and GigaBayes. Over the past six months, I have completely rewritten GigaBayes to enable its efficient use on the >1k sample datasets which are presently being generated by the 1000 Genomes Project. I suggest you look into using it if you are interested in small indel calling.

The new program is called FreeBayes. In addition to a host of improvements in interface, reliability, and algorithmic flexibility, FreeBayes should provide several orders of magnitude improvement in runtime performance over GigaBayes and BamBayes. It uses the same basic population-based Bayesian framework as its predecessors to segregate true variant events from sequencing and alignment artifacts. We have provided it under a liberal open source (MIT) license. We haven't yet submitted a publication describing the work, but we would love to provide the system to users for testing.

In its simplest operation, FreeBayes uses BAM alignment file(s) and the corresponding FASTA reference sequence to generate a VCF report describing variant individuals and sites:

% freebayes -f h.sapiens.fasta -v variants.vcf NA20504.bam

(This would analyze all positions in h.sapiens.fasta for which NA20504.bam has coverage.)

FreeBayes now detects insertion, deletion, MNP (multi-base mismatches), and "complex" allelic variants by default.

The core insight of the algorithm is that a neutral model of the likely distribution of alleles in a population can be used to improve our detection efficiency of true events within a population of individuals. (FreeBayes models this distribution using the Ewens Sampling Formula.) Thus, whenever possible, multiple samples from the same or closely related species should be used. Where such information is not available, the reference may be counted as an additional sample using the --use-reference-allele flag.

To my knowledge, FreeBayes is significantly different than other variant detection systems in common use in that it is not limited to the analysis of haploid or diploid individuals. The assumed ploidy or copy number of the samples is not fixed, and can be set to any number (via the --ploidy flag). This enables the extension of the algorithm to variant calling in species with more than 2 copies of each locus. Additionally, the results of pooled sequencing experiments may be analyzed by setting ploidy equal to the number of alleles per site in the pooled population. I plan to enable sequence and region-specific configuration of copy number in the very near future, as this is directly applicable to variant calling in the 1000 Genomes.

Another useful feature is that FreeBayes can read BAM on its standard input. This allows the application of custom input filters or base-quality adjustment methods to alignments without requiring that the alignment files be rewritten. For instance, this command will apply samtools BAQ adjustment to aln.bam and then call SNPs, writing VCF output to standard output:

% samtools fillmd -br aln.bam | freebayes -f reference.fasta

As a core component in the NCBI variant calling pipeline, FreeBayes is currently under very active development. Please contact me with any questions, feature requests, or bug reports. My email is listed on the FreeBayes github page.

I'd also love to collaborate with anyone working on interesting variant detection problems!

ADD COMMENTlink modified 6.2 years ago • written 7.0 years ago by Erik Garrison2.0k

The prior is a composition of both the ESF, which estimates the probability of a given collection of alleles from a population under neutral selection, and the probability of sampling a given set of genotypes in a diploid population. The second term strongly differentiates between case (a) and (b) which you describe.

ADD REPLYlink written 7.0 years ago by Erik Garrison2.0k

Let me be more specific. Suppose we have the following two scenarios: a) 99 samples are identical to the reference and 1 sample is a homozygous non-reference; b) 98 samples are identical to the reference and 2 samples are heterozygotes. The likelihoods of the two cases are different. How do you distinguish these two cases with Ewen's sampling? Thanks.

ADD REPLYlink written 7.0 years ago by lh330k

Does freebayes treat the sample as if they come from pooled sequencing?

ADD REPLYlink written 7.0 years ago by lh330k

No! It only treats samples as a pool / the same if they are tagged in the BAM file as the same sample. It outputs a column of VCF format (a sample record) for each sample specified in the header of the BAM file, unless a specific limiting list of samples is provided in a file (--samples).

ADD REPLYlink written 7.0 years ago by Erik Garrison2.0k

I think I mainly want to ask how you make use of diploid information when you do Ewen's Sampling. The sampling itself does not take account of ploidy.

ADD REPLYlink written 7.0 years ago by lh330k

I see. Thanks for the explanation.

ADD REPLYlink written 7.0 years ago by lh330k

You're welcome :)

ADD REPLYlink written 7.0 years ago by Erik Garrison2.0k

I'm looking into variant callers for viral population studies and was wondering if FreeBayes is suitable for this purpose? It seems that the user must specify a ploidy value, but this is obviously uknown in a viral population. Are there suggested values to start with? Does changing the ploidy value affect the statistics that are calculated? Any insight would be appreciated.

ADD REPLYlink written 5.0 years ago by jgbaum140

Yes, it is suitable for viral population studies. I've made some changes recently which are designed to enable frequency-based pooled sequencing. See for an overview. Please email the freebayes list or query in biostars if you have questions, as this is not the best context to answer your questions.

ADD REPLYlink written 4.7 years ago by Erik Garrison2.0k

Hi Erik. I have a variant calling application which I'm thinking Freebayes will be perfect for but just looking for some advice on best usage.

Experiment is as follows.. Rna-Seq, de novo transcriptome assembly of a fungus in it's dikaryon stage. So in the dikaryon stage you essentially have two genetically distinct individuals which are both haploid. There are two genetically distinct haploid nuclei in the same mycelium. We have two fungal samples and 2 alignments (aligned against the transcriptome assembly which was built after combining all reads from the two samples). But in those 2 samples we have sequenced 4 haploid genomes.

I'm thinking Freebayes shouild be the way forward for us calling variants in these samples because (i) you can use the reference as an extra sample and we only have the 2 samples (ii) you can set ploidy and there is this 'pooled sequencing' setting which I don't properly understand but I'm thinking may be what we need here.

Is there a combination of the ploidy and pooled sequencing settings which would cover our situation - like ploidy set to 2 and set the --pooled-discrete flag, while providing both samples as input and setting -Z to use the reference as a third sample?

ADD REPLYlink written 4.1 years ago by gareth.linsmith0

Hi Gareth, please write me an email if you are still in need of advice. I'm sorry I missed this, as mentioned before the best place for advice is the freebayes list!

ADD REPLYlink written 3.9 years ago by Erik Garrison2.0k
gravatar for lh3
7.0 years ago by
United States
lh330k wrote:

For cancer, SNVMix2 is probably better as it considers the specific issues in cancer resequencing. For other resequencing, both GATK and samtools (especially with the recent improvement) are good. For indels, Dindel. GATK is reimplementing Dindel, too. In addition, whatever SNP caller you use, remember to apply BAQ (the last example). GATK/FreeBayes/SAMtools calls will all be significantly improved. (I think Erik would also agree.)

ADD COMMENTlink written 7.0 years ago by lh330k

I didn't mention this, but FreeBayes can read BAM input on stdin. This allows the application of BAQ without having to rewrite the input files prior to SNP calling. This benefits from a fast BAQ algorithm :). (I'll edit my response to note this feature. At some point I need to update the README to reflect all these changes.)

ADD REPLYlink written 7.0 years ago by Erik Garrison2.0k

Also, I agree that the base quality adjustment goes a long way to improving the quality variant detection.

ADD REPLYlink written 7.0 years ago by Erik Garrison2.0k

So you mean that now we do not need to do BAQ before freebayes snp calling?

ADD REPLYlink written 3.1 years ago by Chen560
gravatar for Brad Chapman
7.5 years ago by
Brad Chapman9.1k
Boston, MA
Brad Chapman9.1k wrote:

Mosaik for alignment and GigaBayes(PbShort) has worked well for me in the past:

For my next SNP project will be using Broad's Genome Analysis Toolkit (GATK):

ADD COMMENTlink written 7.5 years ago by Brad Chapman9.1k

GATK seems promising. The unified genotyper considers indels in the likelihood model? Finding variation is pretty straighforward, but calling it a SNP still is quite tricky.

ADD REPLYlink written 7.5 years ago by Jarretinha3.2k

As noted elsewhere on BioStar, FreeBayes is the successor to GigaBayes. GigaBayes is no longer actively maintained.

ADD REPLYlink written 6.2 years ago by Erik Garrison2.0k
gravatar for Louis Letourneau
7.5 years ago by
Louis Letourneau790 wrote:

We use samtools pileup+varfilter with SNVmix to filter 'valid' SNPs (it all depends on coverage and experiment.

ADD COMMENTlink written 7.5 years ago by Louis Letourneau790
gravatar for Rm
7.0 years ago by
Danville, PA
Rm7.5k wrote:

I use bowtie/BWA --> Samtools pileup > Varfilter

ADD COMMENTlink written 7.0 years ago by Rm7.5k
gravatar for Malachi Griffith
5.7 years ago by
Washington University School of Medicine, St. Louis, USA
Malachi Griffith16k wrote:

We use SomaticSniper and VarScan. In addition to SNVMix2, FreeBayes, and GATK mentioned in other answers I would also add SOAPindel, SOAPsnv, SOAPsnp, Atlas2, SNVer, TREAT, and SeqEM.

Some are geared towards SNPs others toward SNVs or Indels.

ADD COMMENTlink written 5.7 years ago by Malachi Griffith16k

Hello, I am novice on SomaticSniper, bam-somaticsniper -q 1 -Q 40 -f ucsc.hg19.fasta ERR031023.bam ERR031024.bam ERR031024.snp.vcf, this is the command line I used to call different snps between one pair of cancer and normal samples. Unfortunately, I got a large number of machine artifact. May you send me your command line of Somaticsniper? Any suggestion is appreciated.

ADD REPLYlink written 4.7 years ago by jiagehao10
gravatar for Lhl
6.5 years ago by
United States
Lhl720 wrote:

Have you compared the difference between variants called by different variants callers?

ADD COMMENTlink written 6.5 years ago by Lhl720

I guess you should better open a new thread for this question, to optimize your chances of detailed answers.

ADD REPLYlink written 6.5 years ago by toni2.1k
gravatar for Larry_Parnell
6.2 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

Don Conrad describes here tools and approaches he uses to identify CNVs from SNP genotype data. The program looks for stretches of homozygous genotypes interspersed with Mendelian errors, which might indicate the transmission of a large deletion. Details are at the above link.

ADD COMMENTlink written 6.2 years ago by Larry_Parnell16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1266 users visited in the last hour