Our team recently sequenced a whole genome (WG) of bacterial isolate and performed denovo assembly of that bacterial isolate. Currently I am planning to do following things
1) Compare the WG sequenced bacterial isolate to other publicly available sequences of Escherichia or any other deposited sequences in NCBI. I am thinking of doing BLAST against the custom database. Creating the custom database of all Escherichia isolates and blasting it against my WG sequenced bacterial isolate. Am I correct with my approach?
2) I received another similar strain of WG sequenced bacterial isolate from our collaborator. I would like to compare our team's WG sequenced isolate (isolate A) with our collaborator's isolate B and identify SNPs between the two isolates. I have used GATK for SNP identification for human samples. But for bacterial sequences what methods/tools are used?
I am open for any other recommendations to the above requirement 1 and 2. Any suggestions.
Have you looked at
mauve(http://darlinglab.org/mauve/mauve.html ). You may also want to look at aligners meant to align chromosome size chunks like LAST or LASTZ. Blast is a local aligner and will not be totally appropriate here.
For #2 you could use
callvariants.shfrom BBMap suite after alignments to reference since you have simple haplid genomes because of bacteria.
Yes, I checked it, as per their website currently it is unsuitable for datasets with more than 50 bacterial genomes. But I suspect the task 1 might have more than 50 bacterial genomes.
For the task2, I need to compare isolate A and isolate B and identify variants. Do you mean I need to consider one of the isolate as the reference?
It is useful to include this type of information in original question (>50 genomes). If you are doing smaller comparisons then using mauve would still give you a birds-eye view of overall rearrangements across these strains (which should be largely similar).
If you want to compare SNP present in the strains you are going to need to use a particular strain as reference (which can be one in GenBank) and then compare the rest to that. You can then compare the VCF files.
Mash/sourmash mentioned below would also be good programs to try.
Apologies for missing that information, I was exactly not sure how many sequences I am going to consider for the comparison. Thanks for your suggestions.
I have been told to do just comparison using BLAST by using the whole genome bacterial isolate against the sequences in the NCBI. First going to do BLAST as the person. I think the BLAST output is going to give multiple matching entries with different sequenced genome. What can they infer with just the BLAST results?
In addition, I am going to trying MAUVE and Mash as well.
Make sure you adjust parameters to get long matches, not allow many differences, gaps etc. With blast results you will get the local matches which should be many in this case.
For callvariants.sh, can I use the input file in FASTA format?
I am not going to do the below step for generating a sam file, because I have isolate A contig.fa and isolate B contig.fa.
bbmap.sh in=0001.fastq in2=0002.fastq out=bbmap_mapped.sam ref=H37Rv_reference.fa
No. Only the reference file is in fasta format. Check in-line help for the program by running it without any options. Part included here.
Yeah, I checked that on the commandline. That's why I asked it says I need sam and bam input. My requirement is not to compare isolate A and isolate B with a reference sequence. I need to compare isolate A with isolate B and then identify SNPs. I have two contigs fasta file.
For instance, then I need to align isolate A against isolate B (considering it as a reference) and generate a bam file for isolateA and work on callvariants.sh. Am I correct?
Yes on the second question.
As discussed, the isolateA (denovo assembly carried out) assembled contigs.fasta has been used as the reference and created the required index files. Then the FASTQ files of isolateB has been aligned against the isolateA contigs.fasta (reference) using BWA. Then using callvariants.sh I found 1 variant.
I didn't use the assembled contigs.fasta from isolateB instead I used the trimmed FASTQ files of isolateB for alignment and variant calling. I know using FASTQ files and identifying the SNPs is the standard method.
I need a clarification here, should I use FASTQ files or assembled contigs.fasta for identifying SNPs?
Is there a way for me to use the assembled contigs.fasta file for identifying SNPs?
If your strains are near identical then I find that plausible. Does
mauveshow this to be true?
If you want to use standard SNP calling methods then you would need to use fastq data.
You could try doing pair-wise (or multiple alignments of smaller contigs) to see if you could identify changes that way. These can be represented as dot-plots. I don't think these tools allow you to call SNP's.
D_GENIES is a tool to do that.
I annotated both the contigs.fasta using PROKKA. Then I used the isolateA.gbk and isolateB.gbk for generating the mauve output.
Then I tried reordering the contigs using the isolateA.gbk and isolateB.gbk.
online coin flip simulator
Genomax, I tried blasting contigs > 500bp (42 contigs) against the "nr database" and optimize for "Highly similar sequences (megablast)" option selected. I got an error.
Similarly I used only one contig (the largest contig) against the nr database and optimize for "Highly similar sequences (megablast)" option selected. Also, I got an error.
Not much we can do to help unless you tell us what the error was.
This is the error
There was a problem with the search. Please, contact Help Desk and include RID P07DS4TZ01R
Informational Message: [blastsrv4.REAL]: Error: Process size limit exceeded, resulting in SIGXFSZ (25).
Sounds like you will need to run these searches locally. Web blast does not support queries you are using. Contact the help desk with that reference number to confirm.