Question

Comparing the WG sequenced isolate against the other same genus isolates on the public database

1

Entering edit mode

4.7 years ago

bioinforesearchquestions ▴ 370

Hi

Our team recently sequenced a whole genome (WG) of bacterial isolate and performed denovo assembly of that bacterial isolate. Currently I am planning to do following things

1) Compare the WG sequenced bacterial isolate to other publicly available sequences of Escherichia or any other deposited sequences in NCBI. I am thinking of doing BLAST against the custom database. Creating the custom database of all Escherichia isolates and blasting it against my WG sequenced bacterial isolate. Am I correct with my approach?

2) I received another similar strain of WG sequenced bacterial isolate from our collaborator. I would like to compare our team's WG sequenced isolate (isolate A) with our collaborator's isolate B and identify SNPs between the two isolates. I have used GATK for SNP identification for human samples. But for bacterial sequences what methods/tools are used?

I am open for any other recommendations to the above requirement 1 and 2. Any suggestions.

alignment WGS SNPs BLAST bacteria • 2.6k views

ADD COMMENT • link updated 4.7 years ago by bioguy ▴ 50 • written 4.7 years ago by bioinforesearchquestions ▴ 370

1

Entering edit mode

Have you looked at mauve (http://darlinglab.org/mauve/mauve.html ). You may also want to look at aligners meant to align chromosome size chunks like LAST or LASTZ. Blast is a local aligner and will not be totally appropriate here.

For #2 you could use callvariants.sh from BBMap suite after alignments to reference since you have simple haplid genomes because of bacteria.

ADD REPLY • link 4.7 years ago by GenoMax 141k

0

Entering edit mode

Yes, I checked it, as per their website currently it is unsuitable for datasets with more than 50 bacterial genomes. But I suspect the task 1 might have more than 50 bacterial genomes.

For the task2, I need to compare isolate A and isolate B and identify variants. Do you mean I need to consider one of the isolate as the reference?

ADD REPLY • link 4.7 years ago by bioinforesearchquestions ▴ 370

1

Entering edit mode

It is useful to include this type of information in original question (>50 genomes). If you are doing smaller comparisons then using mauve would still give you a birds-eye view of overall rearrangements across these strains (which should be largely similar).

If you want to compare SNP present in the strains you are going to need to use a particular strain as reference (which can be one in GenBank) and then compare the rest to that. You can then compare the VCF files.

Mash/sourmash mentioned below would also be good programs to try.

ADD REPLY • link 4.7 years ago by GenoMax 141k

0

Entering edit mode

Apologies for missing that information, I was exactly not sure how many sequences I am going to consider for the comparison. Thanks for your suggestions.

ADD REPLY • link 4.7 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Hi genomax,

I have been told to do just comparison using BLAST by using the whole genome bacterial isolate against the sequences in the NCBI. First going to do BLAST as the person. I think the BLAST output is going to give multiple matching entries with different sequenced genome. What can they infer with just the BLAST results?

In addition, I am going to trying MAUVE and Mash as well.

ADD REPLY • link 4.7 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

BLAST output is going to give multiple matching entries

Make sure you adjust parameters to get long matches, not allow many differences, gaps etc. With blast results you will get the local matches which should be many in this case.

ADD REPLY • link 4.7 years ago by GenoMax 141k

0

Entering edit mode

For callvariants.sh, can I use the input file in FASTA format?

I am not going to do the below step for generating a sam file, because I have isolate A contig.fa and isolate B contig.fa. bbmap.sh in=0001.fastq in2=0002.fastq out=bbmap_mapped.sam ref=H37Rv_reference.fa

callvariants.sh in=isolateB.fa ref=isolateA.fa out=BBMap_isolateA_and_isolateB_variant_call.vcf

ADD REPLY • link 4.7 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

No. Only the reference file is in fasta format. Check in-line help for the program by running it without any options. Part included here.

Description:  **Calls variants from sam or bam input.**
In default mode, all input files are combined and treated as a single sample.
In multisample mode, each file is treated as an individual sample,
and gets its own column in the VCF file.  Unless overridden, input file
names are used as sample names.

Usage:  callvariants.sh in=<file,file,...> vcf=<file> ref=<file>

Input may be sorted or unsorted.
The reference should be fasta.

ADD REPLY • link 4.7 years ago by GenoMax 141k

0

Entering edit mode

Yeah, I checked that on the commandline. That's why I asked it says I need sam and bam input. My requirement is not to compare isolate A and isolate B with a reference sequence. I need to compare isolate A with isolate B and then identify SNPs. I have two contigs fasta file.

For instance, then I need to align isolate A against isolate B (considering it as a reference) and generate a bam file for isolateA and work on callvariants.sh. Am I correct?

ADD REPLY • link 4.7 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Yes on the second question.

ADD REPLY • link 4.7 years ago by GenoMax 141k

0

Entering edit mode

Hi genomax,

As discussed, the isolateA (denovo assembly carried out) assembled contigs.fasta has been used as the reference and created the required index files. Then the FASTQ files of isolateB has been aligned against the isolateA contigs.fasta (reference) using BWA. Then using callvariants.sh I found 1 variant.

I didn't use the assembled contigs.fasta from isolateB instead I used the trimmed FASTQ files of isolateB for alignment and variant calling. I know using FASTQ files and identifying the SNPs is the standard method.

I need a clarification here, should I use FASTQ files or assembled contigs.fasta for identifying SNPs?

Is there a way for me to use the assembled contigs.fasta file for identifying SNPs?

ADD REPLY • link 4.7 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Then using callvariants.sh I found 1 variant.

If your strains are near identical then I find that plausible. Does mauve show this to be true?

If you want to use standard SNP calling methods then you would need to use fastq data.

Is there a way for me to use the assembled contigs.fasta file for identifying SNPs?

You could try doing pair-wise (or multiple alignments of smaller contigs) to see if you could identify changes that way. These can be represented as dot-plots. I don't think these tools allow you to call SNP's.

D_GENIES is a tool to do that.

ADD REPLY • link 4.7 years ago by GenoMax 141k

0

Entering edit mode

I annotated both the contigs.fasta using PROKKA. Then I used the isolateA.gbk and isolateB.gbk for generating the mauve output.

Then I tried reordering the contigs using the isolateA.gbk and isolateB.gbk.

Brigs output

online coin flip simulator

ADD REPLY • link 4.7 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Genomax, I tried blasting contigs > 500bp (42 contigs) against the "nr database" and optimize for "Highly similar sequences (megablast)" option selected. I got an error.

Similarly I used only one contig (the largest contig) against the nr database and optimize for "Highly similar sequences (megablast)" option selected. Also, I got an error.

ADD REPLY • link 4.7 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Not much we can do to help unless you tell us what the error was.

ADD REPLY • link 4.7 years ago by GenoMax 141k

0

Entering edit mode

This is the error

There was a problem with the search. Please, contact Help Desk and include RID P07DS4TZ01R

Informational Message: [blastsrv4.REAL]: Error: Process size limit exceeded, resulting in SIGXFSZ (25).

ADD REPLY • link 4.7 years ago by bioinforesearchquestions ▴ 370

0

Entering edit mode

Sounds like you will need to run these searches locally. Web blast does not support queries you are using. Contact the help desk with that reference number to confirm.

ADD REPLY • link 4.7 years ago by GenoMax 141k

score 1 · Answer 1 · 2019-08-20

If you're interested in simply comparing overall genome similarity, Mash (using minhash to compare genomic content) has become a gold standard of sorts for what you're describing (https://github.com/marbl/mash). Mauve also would be appropriate if you want information that an alignment could give you (i.e. genome rearrangement), but it will be less computationally efficient and perhaps overkill.

For SNP calling, you could alternatively use IGV, which calls SNPs by comparing to reference genomes for a given isolate (http://software.broadinstitute.org/software/igv/book/export/html/6).

Anvio also may be helpful in visualizing (or the underlying analysis within) your endeavors. http://merenlab.org/software/anvio/