Hello,
I'm currently performing a blast search with a group of sequences from a recently assembled transcriptome. We want to know how many of our transcripts have a blast match with plants protein sequences. Our idea is to use clustered_nr database to reduce the number of redundancy while searching.
We used the helper script get-cluster-representatives-for-taxid.sh
that comes with clustered_nr and got a list of accessions of viridiplantae.
The issue is that, when we use blastdbcmd
to get the fasta sequences from the database, they somehow doesn't exist.
So for example, one accession returned from the script is 0405229A
, but when I try to get the sequence I got this:
$ blastdbcmd -entry 0405229A -db clustered_nr/nr_cluster_seq
Error: [blastdbcmd] Entry not found: 0405229A
Error: [blastdbcmd] Entry or entries not found in BLAST database
I also tried with nr, just in case. Same result
$ blastdbcmd -entry 0405229A -db nr_database/nr
Error: [blastdbcmd] Entry not found: 0405229A
Error: [blastdbcmd] Entry or entries not found in BLAST database
If I try to search for this accession directly into NCBI's webpage there is a protein with this accession, however no trace of it in the local database.
What can be the cause of this? Is maybe our database outdated? We downloaded nr locally on december last year, and clustered_nr a month ago
Which version of
blast+
are you using? I am not seeing this helper script with lastestv.2.16.0
.That does not look like a valid genbank accession.
The script comes with
clustered_nr
database located here. And yes, it doesn't look like a valid genbank accession, however it exists in NCBII'm currently using the latest
blast+ v2.16.0
Will take a look at the script.
In meantime, you can use
EntrezDirect
(LINK) to get the sequences.Got curious about this record. It turns out that
prf
(Protein Research Foundation) record(s) actually predate GenBank. They are included for historical reasons in databases likenr
to maintain compatibility.Was that list huge? Trying to run this
get-cluster-representatives-for-taxid.sh -t 33090
and no output has been produced for ~3 hours. I was able to get "normal" looking NCBI accession numbers with taxID's that are more focused.It was indeed huge, for viridiplantae we have
And yes, with more focused taxID's the search is faster because there are less accessions in that corresponding taxa.