Hello,
I have ~2000-20000 protein queries. I wish to find homologs of these queries in NCBI's microbial genomes. From what I can tell, vanilla BLASTP
doesn't do this - it just gives hits in a protein DB, but these are not linked to a specific nucleotide DB entry. Please let me know if this is incorrect. Also, given the number of queries I'm doing this with BLAST+
v2.12.0+ (not the web form).
I've considered (i) downloading genomes from ftp, making a local db (assuming the genomes are annotated), and BLASTP
-ing that or (ii) getting a list of protein accessions and trying to link them to a genome entry with an entrez
pipeline (similar to here, but different use case), but the easiest option seems to be a simple TBLASTN
of my queries with the -remote
flag.
Assuming there's not a better option for achieving the above, can I clarify what the database
name should be for TBLASTN
- is it still nr
? On the web service it gives the DB as Representative genomes (ref_prok_rep_genomes
). I have tried both ref_prok_rep_genomes
, nr
and nt
and all are very slow - I've ended up cancelling jobs for a single query because they are taking much longer than I would expect, leading me to suspect an error on my end. Note, I assume the spaces mean Representative genomes
is not a valid name.
Any suggestions? My cmds:
tblastn -task tblastn-fast -query "path/to/single_query.txt" -remote -db nr -out "path/to/results.xml" -outfmt 5 -evalue 0.00005
tblastn -task tblastn-fast -query "path/to/single_query.txt" -remote -db ref_prok_rep_genomes -out "path/to/results.xml" -outfmt 5 -evalue 0.00005
tblastn -task tblastn-fast -query "path/to/single_query.txt" -remote -db nt -out "path/to/results.xml" -outfmt 5 -evalue 0.00005
Cheers!
Hello, thanks for your reply. Isn't diamond protein to protein alignment? I am trying to find protein homologs that are linked to cognate nucleotide sequences (ideally without having to download huge databases from FTP), so I need something that can translate proteins for alignment to nucleotide sequences (unless one of the other approaches in my post makes more sense!).
but you are correct that it needs a protein database, so it runs
blastp
andblastx
and nottblastn
you could still look up the nucleotides against the proteins, if it is indeed 10,000x faster then it is worth it
the remote NCBI blast is not designed to align a large number of sequences against NR - the process is very time-consuming and they can't provide that much computing power with no restrictions
if you want to run batch programs, your best bet is to run it locally with an accelerated tool