Question: Retrieving Taxonomy from Uniprot/Swissprot ACC_ID From Blastx Results
0
gravatar for ladypurrsia
2.5 years ago by
ladypurrsia40
ladypurrsia40 wrote:

I have completed a blastx run on my samples and have obtained the following result (example):

$head blastx_result.txt

NS500162:172:HG5CJBGXX:1:11101:2522 ZWIP2_ARATH 52.500 40 19 0 2 121 25 64 8.26e-07 44.3

I would like to take the ACC_ID number, in this case, ZWIP2_ARATH and find the taxonomic information for this. After doing a search, I found this site: UniProtKB/Swiss-Prot entries.

Here is what this .txt file (from link) looks like:

ENTRY NAME  AC     nb AA   Description - Biological Source 
ZWIP2_ARATH Q9SVY1   383   Zinc finger protein WIP2 (Protein  TRANSMITTING TRACT) (WIP-                                                     
                           domain protein 2) (AtWIP2) [Gene: WIP2 or NTT or At3g57670 
                           or F15B8.140] - Arabidopsis thaliana (Mouse-ear cress)
023R_IIV3   Q197D7   106   Uncharacterized protein 023R [Gene: IIV3-023R] -                                                                   
                           Invertebrate iridescent virus 3 (IIV-3) (Mosquito virus)

This text file contains all of the ACC_ID's and links them to the respective function and taxonomy. The taxonomy comes after the final '-' delimiter (there can be more than one). However, a simple grep command (grep -e 001R_FRG3G shortdes.txt) will not work because of the way this file is set up. One ACC_ID can take 1, 2, or 3 total lines, depending on the ACC_ID.

So, I thought about removing new lines:

awk '{ printf "%s", $0 }'

but this makes a mess out of the file - as it keeps all the tabs and major spacing's, but it's all one line and that's not practical.

I also must add that I have > 500,000 of these ACC_IDs to look up and map to Taxonomy!

There must be a simple solution to just extracting the taxonomy from this file or by any other means. Any inkling of light on a much more practical way to do this would be incredibly appreciated, indeed!

Thanks a ton!

uniprot blastx taxonomy • 1000 views
ADD COMMENTlink modified 2.5 years ago • written 2.5 years ago by ladypurrsia40
1
gravatar for Pierre Lindenbaum
2.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum124k wrote:

using xml/xpath

$ curl -sL "http://www.uniprot.org/uniprot/001R_FRG3G.xml" | xmllint  --xpath "//*[name() ='organism']/*[name()='name' and @type='scientific']/text()" -

Frog virus 3 (isolate Goorha)
ADD COMMENTlink written 2.5 years ago by Pierre Lindenbaum124k

Pierre: Thank you so much!!! May I ask - is there a way to give a file of these ACC_IDs? Because I have > 500,000 to look up and I cannot input each one manually.

ADD REPLYlink written 2.5 years ago by ladypurrsia40
1

there is a uniprot batch query: http://www.uniprot.org/help/uploadlists

ADD REPLYlink written 2.5 years ago by Pierre Lindenbaum124k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 688 users visited in the last hour