Question: How can I retrieve the Protein 'name' using my BLAST tab delimited results?
0
gravatar for Tawny
3.3 years ago by
Tawny130
United States
Tawny130 wrote:

I have a file of BLAST protein results. It is over 100000 lines long. I want to get the protein/gene title associated with each accession number. I have been looking for a file I could search through similar to gene2refseq that would have the protein title mapped to the GI or accession with no luck.

I have tried to figure out how to use Bio::DB::EUtilities to retrieve what I need but I don't know what method to use that will get what I want back.

For example, using protein accession WP_009600323.1 I want to get back tRNA uridine(34) 5-carboxymethylaminomethyl synthesis enzyme MnmG [Vibrio caribbeanicus].

How can I get this information? Thank you.

blast protein bioperl gene perl • 1.3k views
ADD COMMENTlink modified 3.3 years ago by DCGenomics320 • written 3.3 years ago by Tawny130
1

In case you're blasting against some prebuilt NCBI blast db, you could have:

-outfmt '6 std stitle'

or

-outfmt '6 std salltitles'

But only suckers read things like manuals or output of:

command -h
command -help
man command

Right?

Anyway, you don't have to re-blast (assuming prebuilt NCBI blast db):

blastdbcmd -help
ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by 5heikki8.5k
1

Let us give @Tawny the benefit of doubt.
As @5heikki said you can use blastdbcmd (a command line utility from blast+ package) to get the info you need. blastdbcmd -db /path_to/nr -entry WP_009600323.1 -outfmt "%t"
Check blastdbcmd -help to investigate -entry_batch option that would allow you to provide a file with these ID's and get all results back at once.

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by genomax71k

Both @5heikki and @genomax2 are correct. If the -outfmt stitle had been used my problem would have already been solved. BLAST+ can give me what I need. But in this instance I do not have the database that was used to generate this BLAST output. I am running the blastdbcmd using a database that I have that should be close to what was used to generate my data but it is not 100% complete. I am getting a very large number of OID not found errors. What other method can be used to get the protein title?

ADD REPLYlink written 3.3 years ago by Tawny130

Did you try nr? If ID's are not there in nr then perhaps it some may have been retired (perhaps duplicates).

ADD REPLYlink written 3.3 years ago by genomax71k

With Entrez Direct something like:

efetch -db protein -id WP_009600323.1 -format xml \
| xtract -element Prot-ref_name_E -element Org-ref_taxname \
| awk 'BEGIN{FS="\t"}{print $1" ["$2"]"}'
tRNA uridine(34) 5-carboxymethylaminomethyl synthesis enzyme MnmG [Vibrio caribbeanicus]

Not sure it's the same elements for all accessions. It's easier with gi number:

efetch -db protein -id 497286106 -format docsum | xtract -element Title
tRNA uridine(34) 5-carboxymethylaminomethyl synthesis enzyme MnmG [Vibrio caribbeanicus]

It's sad how the NCBI is phasing out gi numbers, while so many of their services (including standalone blast) still can't provide the same level of convenience with accessions..

ADD REPLYlink modified 3.3 years ago • written 3.3 years ago by 5heikki8.5k

@5heikki thank you for the efetch command. I am just getting back to this problem after working on some other projects. I was able to include the second efetch command you posted into a Perl script and it is working nicely. It is slow but does the job.

ADD REPLYlink written 3.3 years ago by Tawny130
0
gravatar for DCGenomics
3.3 years ago by
DCGenomics320
United States
DCGenomics320 wrote:

This should work:

cat file_of_accessions | epost -db protein -format acc | efetch -format docsum | xtract -pattern DocumentSummary -element AccessionVersion Title

ADD COMMENTlink written 3.3 years ago by DCGenomics320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1233 users visited in the last hour