Question: How to filter SNPs after Freebayes, if at all, in a fungal haploid genome alignment?
0
gravatar for Lina F
23 months ago by
Lina F160
Boston, MA
Lina F160 wrote:

Hi all,

I aligned Illumina reads of a haploid fungal genome against a reference sequence and then called SNPs with Freebayes:

/home/ubuntu/freebayes/bin/freebayes -f reference.fasta alignment.bam --ploidy=1 --min-alternate-qsum 30 -F 0.05 > snps.vcf

I then followed this tutorial to filter the output of freebayes further using these commands with vcffilter:

  • -f “SRP > 20” (Strand balance probability for the reference allele)
  • -f “SAP > 20” (Strand balance probability for the alternate allele)
  • -f “EPP > 20” (End Placement Probability)
  • -f “QUAL > 30” (phred scaled variant quality)
  • -f “DP > 100” (depth)

This reduced the SNP count from about 1,700 to about 15.

My questions are as follows:

  1. Is there a set of filtering techniques you start with that you then tweak for different applications?
  2. This filtering strategy is obviously somewhat stringent, since the SNP calls were significantly reduced -- is this too aggressive?

Thanks for any advice!

filtering snp freebayes • 1.4k views
ADD COMMENTlink written 23 months ago by Lina F160
1

That doesn't really make much sense to me... why are you requiring depth > 100, which seems very extreme? Do you even have 100x average coverage? Also, why "-F 0.05" - what exactly are you looking for? I assume that flag tells Freebayes to call variations down to 5% allele frequency.

I'd look at your bam in IGV and see if the variations getting filtered out look real or not.

ADD REPLYlink modified 23 months ago • written 23 months ago by Brian Bushnell16k

I have >100X coverage, so I should be ok on that front. I will take a look at IGV.

Ultimately, I would like to script this pipeline so that I don't have to look at all genomes manually but maybe there is no way around that.

ADD REPLYlink written 23 months ago by Lina F160

Coverage is pretty variable so even if the average is, say, 300x I still don't see a good reason to filter out variants with coverage under 100x. Maybe 10x.

The goal of looking at it manually in this case is not to manually curate all genomes you process, just to find settings that work. Once you identify which variations look real to you that are being removed, you can figure out which filter they are failing and adjust that parameter. If you are looking for germline variations in a haploid, your initial frequency cutoff of 0.05 should probably be increased a lot, perhaps to 0.2, to avoid false positives.

ADD REPLYlink written 23 months ago by Brian Bushnell16k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 973 users visited in the last hour