Question: Detecting known, low frequency SNPs in bacterial populations
gravatar for jleehan
5 months ago by
jleehan0 wrote:

Hi all,

I'm using NGS data to observe how a community of congenic bacteria evolve over time. Basically, I'm starting by inputting several different alleles of one gene into the same medium and seeing how that population changes over time by sequencing the DNA of that gene.

As far as I can tell, all the SNP calling pipelines appear to be more for genome wide discovery than something like what I am doing.

Currently, I am using VarScan and LoFreq to determine the allele frequencies in this community. The agreement between the two pipelines is decent, but as of now VarScan doesn't seem to be able to detect samples that are present in less than 20% of the population (despite what was reported in this paper and all the other programs are reported to be even worse) and I would like to be able to use at least two different programs to verify the presence of these alleles.

Does anyone know of a better low frequency SNP calling pipeline or perhaps a program that allows you to call SNPs based on known alleles (this would actually be the ideal, but I haven't seen anything like this anywhere that I've read)?


P.S. I'm relatively new to NGS work so please bear with me if I've made some egregious oversight here.

snp next-gen • 206 views
ADD COMMENTlink modified 5 months ago by Chris Miller21k • written 5 months ago by jleehan0

What parameters did you use for VarScan?

ADD REPLYlink written 5 months ago by Asaf6.1k

I used the most basic parameters as I have never used VarScan before:

VarScan mpileup2snp $i -o ${SAMPLE}.snp

That's the format I'm using

ADD REPLYlink written 5 months ago by jleehan0

Weird, it should report variants with > 0.01 frequency. On the other hand, it requires at least 2 variant reads out of at least 8 reads that cover the position so if the coverage is low you might lose it. Another option to check is the --strand-filter which might filter out low count reads even further. Take a look at the parameters:

ADD REPLYlink written 5 months ago by Asaf6.1k

It should be fairly high coverage, considering I am only sequencing a 505 and 564 bp region and each multiplex is giving me about 1.7M reads, but I will check look into the parameters more

ADD REPLYlink written 5 months ago by jleehan0
gravatar for Chris Miller
5 months ago by
Chris Miller21k
Washington University in St. Louis, MO
Chris Miller21k wrote:
  • As noted above, the parameters for varscan matter a lot. Relaxing them substantially will probably be helpful
  • If you have known positions, you can just use bam-readcount to query those directly, then apply whatever post-hoc filtering or statistics you prefer.
ADD COMMENTlink written 5 months ago by Chris Miller21k

Hi Chris, thanks for your input!

I haven't heard of this bam-readcount program before, but it sounds very exciting. I can't seem to find online what the exact output of that program would be. Do you happen to know how that would fit into a VarScan pipeline? My current pipeline (not including qc steps) is: bcl2fastq > bowtie2 > samtools index > view > sort > mpileup > VarScan

Essentially I'm asking: what would I need to do with whatever output bam-readcount provides to continue along my pipeline?

ADD REPLYlink written 5 months ago by jleehan0

The output is specified in the readme here: It gives you counts of each base at that position, along with some extra qc information (on quality, strand bias, etc)

You will have to look at those numbers and formulate a way to set a minimum level of detection that is sensible given your experimental design. (is one variant read enough? If not, how many? what's the error rate? can you combine multiple SNP sites to increase power, etc)

ADD REPLYlink written 5 months ago by Chris Miller21k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 546 users visited in the last hour