Question

Whole-genome alignment of two or more bacterial genomes - find structural variants

0

Entering edit mode

6.0 years ago

Tim ▴ 130

Hello,

Let's say I have two or more complete bacterial genome sequences produced by Sanger sequencing and/or Nanopore/PacBio (no Illumina reads). The bacterial genomes in question are 95-99% identical on the nucleotide level. What would be the best way to align these genomes (with a pair-wise or multiple alignment) and identify:

Short variants: single and multiple nucleotide variations (SNP/MNP), indels
Long variants: longer deletions and insertions, inversions, duplications, translocations and so on

I am aware of MUMmer, Mauve and Mugsy, what other programs should I check? Would be great if they could produce a .VCF file as well.

Thanks.

SNP structural variant whole-genome alignment • 5.6k views

ADD COMMENT • link updated 2.8 years ago by penguin • 0 • written 6.0 years ago by Tim ▴ 130

1

Entering edit mode

6.0 years ago

Tm ★ 1.1k

For structural variations analysis, you can try Assemblytics which takes .delta file generated from NUCmer (NUCleotide MUMmer) as input.

ADD COMMENT • link 6.0 years ago by Tm ★ 1.1k

1

Entering edit mode

@toral

What about short variants? How would a genome aligner be able to identify snp's?

ADD REPLY • link 6.0 years ago by naive_user ▴ 80

0

Entering edit mode

According to me, the best way to identify short variants is by mapping reads of one bacteria to the reference genome/scaffolds of another bacteria using samtools/GATK pipeline.

ADD REPLY • link 6.0 years ago by Tm ★ 1.1k

1

Entering edit mode

Well that does not answer my question. But anyways as the OP has long read data, its fairly difficult to identity snp's considering relatively high sequencing error rate

ADD REPLY • link 6.0 years ago by naive_user ▴ 80

0

Entering edit mode

Olson ND, Lund SP, Colman RE, et al. Best practices for evaluating single nucleotide variant calling methods for microbial genomics. Frontiers in Genetics. 2015;6:235. doi:10.3389/fgene.2015.00235.

Calling SNPs using a genome assembly

SNPs can be identified from genome assemblies, however, since coverage is 1x at each position in an assembly, spurious SNPs cannot be filtered due to insufficient coverage, nor can contaminating genomes be identified and subsequently removed. For individual genes, SNPs are identified by extracting alignments using BLASTN (Altschul et al., 1990) followed by pairwise alignment of the SNPs. For whole genome assemblies, SNPs are typically identified from whole genome alignments made with software such as MUMmer (Kurtz et al., 2004), Mugsy (Angiuoli and Salzberg, 2011), and Mauve (Darling et al., 2004). Software has also been developed for the identification of SNPs from genome assemblies for whole genome phylogenetics including kSNP (Gardner and Hall, 2013) and parSNP (Treangen et al., 2014). SNP identification using assemblies is useful when analyzing individual genes, processing huge datasets, or if raw reads are unavailable. However, when using assemblies for SNP discovery, SNPs cannot be evaluated and verified with the underlying raw read data.

Long-read sequencing quality is improving, I am actually thinking that short-read sequencing will be completely replaced by long-read sequencing in the next 5-10 years. That's why I am interested in whole-genome comparisons. As regards the identification of short variants, while I realise that comparison of genomes/contigs/scaffolds/consensuses is less reliable amd I agree that reads mapping followed by GATK or FreeBayes SNP calling is probably the best method for identification of SNPs, whole-genome comparisons are useful in some situations (absence of raw reads, for example).

ADD REPLY • link 6.0 years ago by Tim ▴ 130

0

Entering edit mode

Thanks, haven't heard about it, will check later.

ADD REPLY • link 6.0 years ago by Tim ▴ 130

0

Entering edit mode

4.7 years ago

sh_shaddad77 • 0

can i know the answer after 14 month

What would be the best way to align these genomes (with a pair-wise or multiple alignments) and identify?

it will help me in my objective.

ADD COMMENT • link 4.7 years ago by sh_shaddad77 • 0

0

Entering edit mode

Hi, did you find your answer?

ADD REPLY • link 3.2 years ago by rthapa ▴ 90

0

Entering edit mode

Could I know the solution after 22 months ? lol

ADD REPLY • link 2.8 years ago by penguin • 0

score 2 · Accepted Answer · 2018-05-10

2

Entering edit mode

6.0 years ago

kcamnairb ▴ 40

NucDiff looks interesting. I haven't been able to try it yet, but I know it can output a vcf.

ADD COMMENT • link 6.0 years ago by kcamnairb ▴ 40

0

Entering edit mode

NucDiff looks interesting for sure, will give it a go, thanks.

ADD REPLY • link 6.0 years ago by Tim ▴ 130

0

Entering edit mode

Thank you for recommending NucDiff, it does exactly what I wanted.

ADD REPLY • link 5.9 years ago by Tim ▴ 130