Question

Blastn DB issue

0

Entering edit mode

9 months ago

hakhasan • 0

Hello !

I'm currently trying to develop a local core-genome MLST tool using a combination of a huge genes database and blastn but I came into outputs I can't explain. Here's what I have:

My genome: genome.fasta
My database, comprised of ~1000 genes and n alleles:
- db/GENE01.fasta: 1500 sequences.
- db/GENE02.fasta: 1200 sequences.
- etc

Now, I proceed in two similar ways but with different outputs. Solution #1, generate one DB per gene using makeblastdb and blast each DB to my genome:

makeblastdb -in db/GENE01.fasta -dbtype nucl
blastn -query genome.fasta -db db/GENE01.fasta -evalue 1e-5 -outfmt "6 qseqid qlen length pident sseqid sseq" -out "output_01.txt"

This allowed me to get a match with 100% identity with the allele #27:

NODE_6_length_104687_cov_62.173891  104687  843 100.000 GENE01_27   TTATCTTAGTTGTT...

Now, solution #2 is to compile EVERY genes and alleles from my database (~1,900,000 sequences in total) into one big file "fullDB.fasta" and blast it to my genome:

cat "db/*.fasta" > fullDB.fasta
makeblastdb -max_file_sz '4GB' -in fullDB.fasta -dbtype nucl
blastn -query genome.fasta -db fullDB.fasta -evalue 1e-5 -outfmt "6 qseqid qlen length pident sseqid sseq" -out "fulloutput.txt"

But when I do that (no error and pretty quick), I'm unable to obtain the same results as solution #1. No more GENE01_27 for example.

Is there something I'm missing when using makeblastdb / blastn -db? I supposed it's because genes have closely similar alleles (oftenly one or two SNP differences) but in that case why am I able to obtain something when blasting all alleles from GENE01?

To summarize, solution #1 get correct results but is way too slow (blastn x1000) and solution #2 is fast (only one blastn) but with uncorrect results.

Any idea? Thanks for your time!

cgmlst blastn • 561 views

ADD COMMENT • link updated 9 months ago by Philipp Bayer 8.7k • written 9 months ago by hakhasan • 0

score 2 · Accepted Answer · 2023-11-29

2

Entering edit mode

9 months ago

Philipp Bayer 8.7k

Three things I'd try:

1) The number of alignments BLAST reports in tabular output is fairly low (500?). Increase it to something ridiculous like -num_alignments 999999 to see if your allele pops up.
2) The e-value is calculated using the number of sequences in your database. Try a more lenient e-value cutoff like 1 to see if your allele is included.
3) The default -task megablast can miss many OK-looking hits, it's optimised for close relatives. Try the slower -task blastn to see if you find your allele.

ADD COMMENT • link 9 months ago by Philipp Bayer 8.7k

0

Entering edit mode

Your first solution seems to do the job, I didn't know local blast had such limitations ! I'll try to test your other ones to see if it also changes the output.

Thanks a lot :)

ADD REPLY • link 9 months ago by hakhasan • 0

0

Entering edit mode

Yeah it scares me!!! I am running many BLASTs for many mitogenomes at the moment. There are many highly similar sequence hits on NT across many species and the default number of alignment often does not include the correct species unless I ramp up the number of reported alignments and switch to 'old' task blastn

ADD REPLY • link 9 months ago by Philipp Bayer 8.7k