How can I retrieve the Protein 'name' using my BLAST tab delimited results?
1
0
Entering edit mode
7.9 years ago
Tawny ▴ 180

I have a file of BLAST protein results. It is over 100000 lines long. I want to get the protein/gene title associated with each accession number. I have been looking for a file I could search through similar to gene2refseq that would have the protein title mapped to the GI or accession with no luck.

I have tried to figure out how to use Bio::DB::EUtilities to retrieve what I need but I don't know what method to use that will get what I want back.

For example, using protein accession WP_009600323.1 I want to get back tRNA uridine(34) 5-carboxymethylaminomethyl synthesis enzyme MnmG [Vibrio caribbeanicus].

How can I get this information? Thank you.

blast Protein Perl BioPerl Gene • 4.1k views
ADD COMMENT
1
Entering edit mode

In case you're blasting against some prebuilt NCBI blast db, you could have:

-outfmt '6 std stitle'

or

-outfmt '6 std salltitles'

But only suckers read things like manuals or output of:

command -h
command -help
man command

Right?

Anyway, you don't have to re-blast (assuming prebuilt NCBI blast db):

blastdbcmd -help
ADD REPLY
1
Entering edit mode

Let us give @Tawny the benefit of doubt.
As @5heikki said you can use blastdbcmd (a command line utility from blast+ package) to get the info you need. blastdbcmd -db /path_to/nr -entry WP_009600323.1 -outfmt "%t"
Check blastdbcmd -help to investigate -entry_batch option that would allow you to provide a file with these ID's and get all results back at once.

ADD REPLY
0
Entering edit mode

Both @5heikki and @genomax2 are correct. If the -outfmt stitle had been used my problem would have already been solved. BLAST+ can give me what I need. But in this instance I do not have the database that was used to generate this BLAST output. I am running the blastdbcmd using a database that I have that should be close to what was used to generate my data but it is not 100% complete. I am getting a very large number of OID not found errors. What other method can be used to get the protein title?

ADD REPLY
0
Entering edit mode

Did you try nr? If ID's are not there in nr then perhaps it some may have been retired (perhaps duplicates).

ADD REPLY
0
Entering edit mode

With Entrez Direct something like:

efetch -db protein -id WP_009600323.1 -format xml \
| xtract -element Prot-ref_name_E -element Org-ref_taxname \
| awk 'BEGIN{FS="\t"}{print $1" ["$2"]"}'
tRNA uridine(34) 5-carboxymethylaminomethyl synthesis enzyme MnmG [Vibrio caribbeanicus]

Not sure it's the same elements for all accessions. It's easier with gi number:

efetch -db protein -id 497286106 -format docsum | xtract -element Title
tRNA uridine(34) 5-carboxymethylaminomethyl synthesis enzyme MnmG [Vibrio caribbeanicus]

It's sad how the NCBI is phasing out gi numbers, while so many of their services (including standalone blast) still can't provide the same level of convenience with accessions..

ADD REPLY
0
Entering edit mode

@5heikki thank you for the efetch command. I am just getting back to this problem after working on some other projects. I was able to include the second efetch command you posted into a Perl script and it is working nicely. It is slow but does the job.

ADD REPLY
0
Entering edit mode
7.8 years ago
DCGenomics ▴ 330

This should work:

cat file_of_accessions | epost -db protein -format acc | efetch -format docsum | xtract -pattern DocumentSummary -element AccessionVersion Title

ADD COMMENT

Login before adding your answer.

Traffic: 2248 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6