Question: Best Practice On Variant Discovery For Bacteria?
8
gravatar for lh3
6.7 years ago by
lh331k
United States
lh331k wrote:

I am mostly working with primate data, but I doubt if the best practice for such data is still the best for bacteria given higher divergence between bacterial strains. Here are two possible ways to do SNP/INDEL discovery for bacteria:

  1. Just like variant discovery for mammals, we map reads to the reference strain with a standard mapper and run GATK/samtools/in-house tools to find variants. If this is still the case, which mappers and SNP callers do you use? How do you change the default settings?

  2. Do de novo assembly first and then align contigs to the reference strain to find variants. If this is the case, which assembler (velvet, soapdenovo or abyss?) and aligner (mummer?) do you use?

  3. There much be other ways of analyzing data I do not know...

• 9.4k views
ADD COMMENTlink modified 3.4 years ago by moorem220 • written 6.7 years ago by lh331k
1
  1. BWA/MAQ for sure. all the other aligner/mapper are having some sort of dishonest to the truth.(Sorry to say that, but that is what I found after my in house test). Even though, in some specical cases, bwa still has incorrect mappings. But it is the best in current(speed, space, and correctness) since maq is died. If you pay attention on read quality filtering and reference self similarity reduction, it would be greatly increasing your correctness. Dont let the trash reads mess up entire analysis.
  2. I dont recommend the assembly-snp detection line on Bacteria. If you have to, try to hash based assembler , like Mira. it will generate a consensus result. Velvet with its de-brujin Graph datastructure, could loss some important informations like discarding the SNPs.
  3. Sorry I dont used any others and I dont know either. Try apply math into it rather than others tools. because they have all some sort of included their own pre-assumtpions already which you dont know exactly. (especially skip those fancy GUI designed ones)
ADD REPLYlink written 6.7 years ago by jingtao09110
1

MIRA will also give you strain sequences, use convert_project for that

ADD REPLYlink written 6.7 years ago by Bach540
2
gravatar for geoffreycolburn
6.7 years ago by
geoffreycolburn50 wrote:

I work on Breseq, http://code.google.com/p/breseq/, and would suggest you give it a try. It'll map with SSAHA2.

The big name mappers are BWA and Bowtie. BWA + samtools is fine for SNP and small( < 5bp) INDELS.

ADD COMMENTlink modified 6.7 years ago by Istvan Albert ♦♦ 79k • written 6.7 years ago by geoffreycolburn50

But breseq doesn't support SNPcalling between sequenced case and control strains

ADD REPLYlink written 5.6 years ago by kevinchjp10
2
gravatar for zam.iqbal.genome
6.2 years ago by
United Kingdom
zam.iqbal.genome1.7k wrote:

This is a bit late, I postponed replying to this til our latest paper came out, and then I couldn't find this question anymore. Anyway - this may be of interest: we just published this paper

High-throughput microbial population genomics using the Cortex variation assembler. Z Iqbal, I Turner, G McVean, Bioinformatics 2012; doi: 10.1093/bioinformatics/bts673

http://bioinformatics.oxfordjournals.org/content/early/2012/11/19/bioinformatics.bts673.full.pdf+html

which shows how you can use direct de novo assembly of multiple samples to discover variants. Since bacterial genomes tend to be sequenced to higher depth than vertebrates, and since the genomes are relatively unrepetitive, assembly has comparable power to mapping, but has better specificity for indels and structural variants. Cortex now has a wrapper/framework that allows you to call at multiple kmers. The paper gives a couple of examples, reproducing the results (or comparably good results) for two published studies, one with long (Sanger) reads and one with Illumina reads.

ADD COMMENTlink modified 6.2 years ago • written 6.2 years ago by zam.iqbal.genome1.7k

It is not true that bacteria are "relatively unrepetitive". There are a lot of species which are full of repeats such as copies of insertion sequences etc. For example, Mycobacterium ulcerans has >200 copies of IS2404 and >600 copies of IS2606, each of which is >1500bp. Tandem repeats are common too eg. CRISPR. Sure, they're not as bad as ALU repeats in human etc, but the blanket statement that they aren't repetitive isn't quite true :)

ADD REPLYlink written 5.6 years ago by Torst900

FIne, I stand somewhat corrected. The M ulcerans example is quite surprising to me - thats 20% of its genome! However, I do think on a global scale, where we have trees, wheat, barley, vertebrates, lamprey, lungfish, and bacteria, bacteria are at the less repetivie end of the spectrum.

ADD REPLYlink modified 5.6 years ago • written 5.6 years ago by zam.iqbal.genome1.7k

I sound very petty here! Sorry Torsten.

ADD REPLYlink written 5.6 years ago by zam.iqbal.genome1.7k
2
gravatar for Raygozak
6.2 years ago by
Raygozak1.3k
State College, PA, Penn State
Raygozak1.3k wrote:

Another important tool for haploid variant calling is freebayes (http://bioinformatics.bc.edu/marthlab/FreeBayes), it actually is built to support any ploidy from your organism but makes it easier than samtools mpileup - bcftools, since samtools is tailored for diploid organisms, and while you may be able to process it further, freebayes works like a charm.

ADD COMMENTlink written 6.2 years ago by Raygozak1.3k

I have explained several times: to call from a single sample, a diploid caller is the way to go. This strategy is always better. I have not seen a single counter evidence so far.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by lh331k
1

Could you please address me to longer explanation you might write some time ago? Thanks a lot.

ADD REPLYlink written 6.2 years ago by Biomonika (Noolean)3.0k
1

If a diploid caller is better than a haploid caller from single-sample haploid data, would a polyploid caller not be even better?

ADD REPLYlink written 6.2 years ago by Erik Garrison2.1k
0
gravatar for swbarnes2
6.7 years ago by
swbarnes25.0k
United States
swbarnes25.0k wrote:

I use bwa/samtools, it seems to work fine, and I use velvet, or sometims edena, when I want to make a de novo assembly, but with the read coverage I typically get, it yields a whole lot of pieces.

GATK requires a list of known variants, which is kind of weird when your goal is to look for variants you don't know about.

ADD COMMENTlink written 6.7 years ago by swbarnes25.0k
0
gravatar for liz.batty
6.7 years ago by
liz.batty30
liz.batty30 wrote:

For reference-based variant calling, we map with stampy without bwa pre-mapping and the substitution rate set to 0.01, and call variants using an in-house mix of samtools, picard, and GATK. We assemble with velvet, and sometimes order the contigs into a new reference and map to that with the same tools as the standard references, or we've used Cortex to do de-novo variant calling with some succes to get SNPs and short indels.

ADD COMMENTlink written 6.7 years ago by liz.batty30
0
gravatar for rst
6.2 years ago by
rst20
rst20 wrote:

If you go the de novo route I would suggest looking at EDENA as an assembler. The false positive rates, at least in my experience are greatly reduced which simplifies things downstream. I have noticed that it does require slightly higher coverage >100-150x to work really really well which usually isn't a problem when resequencing prokaryotes.

ADD COMMENTlink written 6.2 years ago by rst20
0
gravatar for moorem
3.4 years ago by
moorem220
United Kingdom
moorem220 wrote:

Use bwa and GATK for sure! 

https://approachedinthelimit.wordpress.com/2015/10/09/variant-calling-with-gatk/

ADD COMMENTlink written 3.4 years ago by moorem220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1370 users visited in the last hour