Blastn DB issue
Entering edit mode
12 weeks ago
hakhasan • 0

Hello !

I'm currently trying to develop a local core-genome MLST tool using a combination of a huge genes database and blastn but I came into outputs I can't explain. Here's what I have:

  1. My genome: genome.fasta
  2. My database, comprised of ~1000 genes and n alleles:
    • db/GENE01.fasta: 1500 sequences.
    • db/GENE02.fasta: 1200 sequences.
    • etc

Now, I proceed in two similar ways but with different outputs. Solution #1, generate one DB per gene using makeblastdb and blast each DB to my genome:

makeblastdb -in db/GENE01.fasta -dbtype nucl
blastn -query genome.fasta -db db/GENE01.fasta -evalue 1e-5 -outfmt "6 qseqid qlen length pident sseqid sseq" -out "output_01.txt"

This allowed me to get a match with 100% identity with the allele #27:

NODE_6_length_104687_cov_62.173891  104687  843 100.000 GENE01_27   TTATCTTAGTTGTT...

Now, solution #2 is to compile EVERY genes and alleles from my database (~1,900,000 sequences in total) into one big file "fullDB.fasta" and blast it to my genome:

cat "db/*.fasta" > fullDB.fasta
makeblastdb -max_file_sz '4GB' -in fullDB.fasta -dbtype nucl
blastn -query genome.fasta -db fullDB.fasta -evalue 1e-5 -outfmt "6 qseqid qlen length pident sseqid sseq" -out "fulloutput.txt"

But when I do that (no error and pretty quick), I'm unable to obtain the same results as solution #1. No more GENE01_27 for example.

Is there something I'm missing when using makeblastdb / blastn -db? I supposed it's because genes have closely similar alleles (oftenly one or two SNP differences) but in that case why am I able to obtain something when blasting all alleles from GENE01?

To summarize, solution #1 get correct results but is way too slow (blastn x1000) and solution #2 is fast (only one blastn) but with uncorrect results.

Any idea? Thanks for your time!

cgmlst blastn • 355 views
Entering edit mode
12 weeks ago

Three things I'd try:

1) The number of alignments BLAST reports in tabular output is fairly low (500?). Increase it to something ridiculous like -num_alignments 999999 to see if your allele pops up.
2) The e-value is calculated using the number of sequences in your database. Try a more lenient e-value cutoff like 1 to see if your allele is included.
3) The default -task megablast can miss many OK-looking hits, it's optimised for close relatives. Try the slower -task blastn to see if you find your allele.

Entering edit mode

Your first solution seems to do the job, I didn't know local blast had such limitations ! I'll try to test your other ones to see if it also changes the output.

Thanks a lot :)

Entering edit mode

Yeah it scares me!!! I am running many BLASTs for many mitogenomes at the moment. There are many highly similar sequence hits on NT across many species and the default number of alignment often does not include the correct species unless I ramp up the number of reported alignments and switch to 'old' task blastn


Login before adding your answer.

Traffic: 1052 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6