Question

Trouble getting representatives from clustered_nr database

0

Entering edit mode

4 months ago

Adolfo • 0

Hello,

I'm currently performing a blast search with a group of sequences from a recently assembled transcriptome. We want to know how many of our transcripts have a blast match with plants protein sequences. Our idea is to use clustered_nr database to reduce the number of redundancy while searching.

We used the helper script get-cluster-representatives-for-taxid.sh that comes with clustered_nr and got a list of accessions of viridiplantae.

The issue is that, when we use blastdbcmd to get the fasta sequences from the database, they somehow doesn't exist.

So for example, one accession returned from the script is 0405229A, but when I try to get the sequence I got this:

$ blastdbcmd -entry 0405229A -db clustered_nr/nr_cluster_seq
Error: [blastdbcmd] Entry not found: 0405229A
Error: [blastdbcmd] Entry or entries not found in BLAST database

I also tried with nr, just in case. Same result

$ blastdbcmd -entry 0405229A -db nr_database/nr
Error: [blastdbcmd] Entry not found: 0405229A
Error: [blastdbcmd] Entry or entries not found in BLAST database

If I try to search for this accession directly into NCBI's webpage there is a protein with this accession, however no trace of it in the local database.

What can be the cause of this? Is maybe our database outdated? We downloaded nr locally on december last year, and clustered_nr a month ago

database blast • 1.1k views

ADD COMMENT • link 4 months ago by Adolfo • 0

1

Entering edit mode

Which version of blast+ are you using? I am not seeing this helper script with lastest v.2.16.0.

one accession returned from the script is 0405229A,

That does not look like a valid genbank accession.

ADD REPLY • link 4 months ago by GenoMax 154k

0

Entering edit mode

The script comes with clustered_nr database located here. And yes, it doesn't look like a valid genbank accession, however it exists in NCBI

I'm currently using the latest blast+ v2.16.0

ADD REPLY • link 4 months ago by Adolfo • 0

1

Entering edit mode

Will take a look at the script.

In meantime, you can use EntrezDirect (LINK) to get the sequences.

$ efetch -db protein -id 0405229A -format fasta
>prf||0405229A phosphorylase pyridoxal binding site
VVFVPDYNVSVAELLIPASDLSEHISTAGMEASGTSNMKFAMBGCZTGIILDGANVE

ADD REPLY • link 4 months ago by GenoMax 154k

0

Entering edit mode

Got curious about this record. It turns out that prf (Protein Research Foundation) record(s) actually predate GenBank. They are included for historical reasons in databases like nr to maintain compatibility.

ADD REPLY • link 4 months ago by GenoMax 154k

0

Entering edit mode

get-cluster-representatives-for-taxid.sh that comes with clustered_nr and got a list of accessions of viridiplantae

Was that list huge? Trying to run this get-cluster-representatives-for-taxid.sh -t 33090 and no output has been produced for ~3 hours. I was able to get "normal" looking NCBI accession numbers with taxID's that are more focused.

ADD REPLY • link 4 months ago by GenoMax 154k

0

Entering edit mode

It was indeed huge, for viridiplantae we have

$ wc -l Viridiplantae_representative_cluster.txt 
16968635 Viridiplantae_representative_cluster.txt

And yes, with more focused taxID's the search is faster because there are less accessions in that corresponding taxa.

ADD REPLY • link 4 months ago by Adolfo • 0