Question: Batch Retrieval of Conserved Domains from a List of IDs?
0
gravatar for kayrouz.1
8 days ago by
kayrouz.10
kayrouz.10 wrote:

Background:

I have a large number (>100,000) of functionally unrelated protein sequences and I want to generate a roughly functional annotation for each. I've tried PROKKA and RASTtk, but they tend to return a large number of "hypothetical" results so I've changed my approach. So far I have used RPS-BLAST to query each against the NCBI conserved domain database, which successfully gives me the NCBI-CDD identifier of the best hit for each. Now I want to retrieve the "name" associated with each of these identifiers.

What I've tried so far:

I currently have a NCBI-CDD identifier that encodes the top domain hit for each of my sequences. The identifiers take the form of a 6-digit number (i.e. 240628). I want to retrieve the "name" associated with each of these identifiers (for 240628, the name is "Phosphoglycerate dehydrogenase (PGDH) NAD-binding and catalytic domains").

In the past I have been able to retrieve information from NCBI using the command line to cycle through EFetch commands and write the output to a file, such as shown below:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=WP_030019524.1&rettype=xml&retmode=ipg

However, when I try this approach with CDD identifiers, I just get a web-based XML readout that I can't really work with. See below for what I mean:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=cdd&id=240628

Is there any way to structure this command such that I receive a downloadable file that I can extract information from?

conserved domain entrez ncbi • 104 views
ADD COMMENTlink modified 8 days ago by vkkodali860 • written 8 days ago by kayrouz.10
2
gravatar for vkkodali
8 days ago by
vkkodali860
United States
vkkodali860 wrote:

You can use Entrez Direct for this as follows:

esummary -db cdd -id 240628 | xtract -pattern DocumentSummary -element Id,Subtitle
240628  Phosphoglycerate dehydrogenase (PGDH) NAD-binding and catalytic domains

If you have a file with a long list of identifiers, you can use epost to first upload that list as follows:

epost -db cdd -input <file.txt> | esummary -db cdd | xtract -pattern DocumentSummary -element Id,Subtitle
ADD COMMENTlink written 8 days ago by vkkodali860

You have solved my problem, thank you!

ADD REPLYlink written 8 days ago by kayrouz.10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1109 users visited in the last hour