Question: How to filter SNPs after Freebayes, if at all, in a fungal haploid genome alignment?
gravatar for Lina F
2.5 years ago by
Lina F160
Boston, MA
Lina F160 wrote:

Hi all,

I aligned Illumina reads of a haploid fungal genome against a reference sequence and then called SNPs with Freebayes:

/home/ubuntu/freebayes/bin/freebayes -f reference.fasta alignment.bam --ploidy=1 --min-alternate-qsum 30 -F 0.05 > snps.vcf

I then followed this tutorial to filter the output of freebayes further using these commands with vcffilter:

  • -f “SRP > 20” (Strand balance probability for the reference allele)
  • -f “SAP > 20” (Strand balance probability for the alternate allele)
  • -f “EPP > 20” (End Placement Probability)
  • -f “QUAL > 30” (phred scaled variant quality)
  • -f “DP > 100” (depth)

This reduced the SNP count from about 1,700 to about 15.

My questions are as follows:

  1. Is there a set of filtering techniques you start with that you then tweak for different applications?
  2. This filtering strategy is obviously somewhat stringent, since the SNP calls were significantly reduced -- is this too aggressive?

Thanks for any advice!

filtering snp freebayes • 2.0k views
ADD COMMENTlink written 2.5 years ago by Lina F160

That doesn't really make much sense to me... why are you requiring depth > 100, which seems very extreme? Do you even have 100x average coverage? Also, why "-F 0.05" - what exactly are you looking for? I assume that flag tells Freebayes to call variations down to 5% allele frequency.

I'd look at your bam in IGV and see if the variations getting filtered out look real or not.

ADD REPLYlink modified 2.5 years ago • written 2.5 years ago by Brian Bushnell17k

I have >100X coverage, so I should be ok on that front. I will take a look at IGV.

Ultimately, I would like to script this pipeline so that I don't have to look at all genomes manually but maybe there is no way around that.

ADD REPLYlink written 2.5 years ago by Lina F160

Coverage is pretty variable so even if the average is, say, 300x I still don't see a good reason to filter out variants with coverage under 100x. Maybe 10x.

The goal of looking at it manually in this case is not to manually curate all genomes you process, just to find settings that work. Once you identify which variations look real to you that are being removed, you can figure out which filter they are failing and adjust that parameter. If you are looking for germline variations in a haploid, your initial frequency cutoff of 0.05 should probably be increased a lot, perhaps to 0.2, to avoid false positives.

ADD REPLYlink written 2.5 years ago by Brian Bushnell17k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 824 users visited in the last hour