Question: Comparing the WG sequenced isolate against the other same genus isolates on the public database
1
gravatar for bioinforesearchquestions
15 months ago by
United States
bioinforesearchquestions280 wrote:

Hi

Our team recently sequenced a whole genome (WG) of bacterial isolate and performed denovo assembly of that bacterial isolate. Currently I am planning to do following things

1) Compare the WG sequenced bacterial isolate to other publicly available sequences of Escherichia or any other deposited sequences in NCBI. I am thinking of doing BLAST against the custom database. Creating the custom database of all Escherichia isolates and blasting it against my WG sequenced bacterial isolate. Am I correct with my approach?

2) I received another similar strain of WG sequenced bacterial isolate from our collaborator. I would like to compare our team's WG sequenced isolate (isolate A) with our collaborator's isolate B and identify SNPs between the two isolates. I have used GATK for SNP identification for human samples. But for bacterial sequences what methods/tools are used?

I am open for any other recommendations to the above requirement 1 and 2. Any suggestions.

bacteria alignment blast snps wgs • 662 views
ADD COMMENTlink modified 15 months ago by bioguy50 • written 15 months ago by bioinforesearchquestions280
1

Have you looked at mauve (http://darlinglab.org/mauve/mauve.html ). You may also want to look at aligners meant to align chromosome size chunks like LAST or LASTZ. Blast is a local aligner and will not be totally appropriate here.

For #2 you could use callvariants.sh from BBMap suite after alignments to reference since you have simple haplid genomes because of bacteria.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax92k

Yes, I checked it, as per their website currently it is unsuitable for datasets with more than 50 bacterial genomes. But I suspect the task 1 might have more than 50 bacterial genomes.

For the task2, I need to compare isolate A and isolate B and identify variants. Do you mean I need to consider one of the isolate as the reference?

ADD REPLYlink written 15 months ago by bioinforesearchquestions280
1

It is useful to include this type of information in original question (>50 genomes). If you are doing smaller comparisons then using mauve would still give you a birds-eye view of overall rearrangements across these strains (which should be largely similar).

If you want to compare SNP present in the strains you are going to need to use a particular strain as reference (which can be one in GenBank) and then compare the rest to that. You can then compare the VCF files.

Mash/sourmash mentioned below would also be good programs to try.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax92k

Apologies for missing that information, I was exactly not sure how many sequences I am going to consider for the comparison. Thanks for your suggestions.

ADD REPLYlink written 15 months ago by bioinforesearchquestions280

Hi genomax,

I have been told to do just comparison using BLAST by using the whole genome bacterial isolate against the sequences in the NCBI. First going to do BLAST as the person. I think the BLAST output is going to give multiple matching entries with different sequenced genome. What can they infer with just the BLAST results?

In addition, I am going to trying MAUVE and Mash as well.

ADD REPLYlink written 15 months ago by bioinforesearchquestions280

BLAST output is going to give multiple matching entries

Make sure you adjust parameters to get long matches, not allow many differences, gaps etc. With blast results you will get the local matches which should be many in this case.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax92k

For callvariants.sh, can I use the input file in FASTA format?

I am not going to do the below step for generating a sam file, because I have isolate A contig.fa and isolate B contig.fa. bbmap.sh in=0001.fastq in2=0002.fastq out=bbmap_mapped.sam ref=H37Rv_reference.fa

callvariants.sh in=isolateB.fa ref=isolateA.fa out=BBMap_isolateA_and_isolateB_variant_call.vcf
ADD REPLYlink modified 15 months ago • written 15 months ago by bioinforesearchquestions280

No. Only the reference file is in fasta format. Check in-line help for the program by running it without any options. Part included here.

Description:  **Calls variants from sam or bam input.**
In default mode, all input files are combined and treated as a single sample.
In multisample mode, each file is treated as an individual sample,
and gets its own column in the VCF file.  Unless overridden, input file
names are used as sample names.

Usage:  callvariants.sh in=<file,file,...> vcf=<file> ref=<file>

Input may be sorted or unsorted.
The reference should be fasta.
ADD REPLYlink modified 15 months ago • written 15 months ago by genomax92k

Yeah, I checked that on the commandline. That's why I asked it says I need sam and bam input. My requirement is not to compare isolate A and isolate B with a reference sequence. I need to compare isolate A with isolate B and then identify SNPs. I have two contigs fasta file.

For instance, then I need to align isolate A against isolate B (considering it as a reference) and generate a bam file for isolateA and work on callvariants.sh. Am I correct?

ADD REPLYlink modified 15 months ago • written 15 months ago by bioinforesearchquestions280

Yes on the second question.

ADD REPLYlink written 15 months ago by genomax92k

Hi genomax,

As discussed, the isolateA (denovo assembly carried out) assembled contigs.fasta has been used as the reference and created the required index files. Then the FASTQ files of isolateB has been aligned against the isolateA contigs.fasta (reference) using BWA. Then using callvariants.sh I found 1 variant.

I didn't use the assembled contigs.fasta from isolateB instead I used the trimmed FASTQ files of isolateB for alignment and variant calling. I know using FASTQ files and identifying the SNPs is the standard method.

I need a clarification here, should I use FASTQ files or assembled contigs.fasta for identifying SNPs?

Is there a way for me to use the assembled contigs.fasta file for identifying SNPs?

ADD REPLYlink modified 15 months ago • written 15 months ago by bioinforesearchquestions280

Then using callvariants.sh I found 1 variant.

If your strains are near identical then I find that plausible. Does mauve show this to be true?

If you want to use standard SNP calling methods then you would need to use fastq data.

Is there a way for me to use the assembled contigs.fasta file for identifying SNPs?

You could try doing pair-wise (or multiple alignments of smaller contigs) to see if you could identify changes that way. These can be represented as dot-plots. I don't think these tools allow you to call SNP's.

D_GENIES is a tool to do that.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax92k

I annotated both the contigs.fasta using PROKKA. Then I used the isolateA.gbk and isolateB.gbk for generating the mauve output.

Mauve-output

Then I tried reordering the contigs using the isolateA.gbk and isolateB.gbk.

Reorded-Contigs-Mauve-output

Brigs output

Briggs-output
online coin flip simulator

ADD REPLYlink modified 15 months ago • written 15 months ago by bioinforesearchquestions280

Genomax, I tried blasting contigs > 500bp (42 contigs) against the "nr database" and optimize for "Highly similar sequences (megablast)" option selected. I got an error.

Similarly I used only one contig (the largest contig) against the nr database and optimize for "Highly similar sequences (megablast)" option selected. Also, I got an error.

ADD REPLYlink modified 15 months ago • written 15 months ago by bioinforesearchquestions280

Not much we can do to help unless you tell us what the error was.

ADD REPLYlink written 15 months ago by genomax92k

This is the error

There was a problem with the search. Please, contact Help Desk and include RID P07DS4TZ01R

Informational Message: [blastsrv4.REAL]: Error: Process size limit exceeded, resulting in SIGXFSZ (25).

BLAST-error

ADD REPLYlink modified 15 months ago • written 15 months ago by bioinforesearchquestions280

Sounds like you will need to run these searches locally. Web blast does not support queries you are using. Contact the help desk with that reference number to confirm.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax92k
1
gravatar for bioguy
15 months ago by
bioguy50
bioguy50 wrote:

If you're interested in simply comparing overall genome similarity, Mash (using minhash to compare genomic content) has become a gold standard of sorts for what you're describing (https://github.com/marbl/mash). Mauve also would be appropriate if you want information that an alignment could give you (i.e. genome rearrangement), but it will be less computationally efficient and perhaps overkill.

For SNP calling, you could alternatively use IGV, which calls SNPs by comparing to reference genomes for a given isolate (http://software.broadinstitute.org/software/igv/book/export/html/6).

Anvio also may be helpful in visualizing (or the underlying analysis within) your endeavors. http://merenlab.org/software/anvio/

ADD COMMENTlink written 15 months ago by bioguy50
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1950 users visited in the last hour