Question: Batch Retrieval of Conserved Domains from a List of IDs?
0
gravatar for kayrouz.1
6 months ago by
kayrouz.10
kayrouz.10 wrote:

Background:

I have a large number (>100,000) of functionally unrelated protein sequences and I want to generate a roughly functional annotation for each. I've tried PROKKA and RASTtk, but they tend to return a large number of "hypothetical" results so I've changed my approach. So far I have used RPS-BLAST to query each against the NCBI conserved domain database, which successfully gives me the NCBI-CDD identifier of the best hit for each. Now I want to retrieve the "name" associated with each of these identifiers.

What I've tried so far:

I currently have a NCBI-CDD identifier that encodes the top domain hit for each of my sequences. The identifiers take the form of a 6-digit number (i.e. 240628). I want to retrieve the "name" associated with each of these identifiers (for 240628, the name is "Phosphoglycerate dehydrogenase (PGDH) NAD-binding and catalytic domains").

In the past I have been able to retrieve information from NCBI using the command line to cycle through EFetch commands and write the output to a file, such as shown below:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&id=WP_030019524.1&rettype=xml&retmode=ipg

However, when I try this approach with CDD identifiers, I just get a web-based XML readout that I can't really work with. See below for what I mean:

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esummary.fcgi?db=cdd&id=240628

Is there any way to structure this command such that I receive a downloadable file that I can extract information from?

conserved domain entrez ncbi • 281 views
ADD COMMENTlink modified 6 months ago by vkkodali1.1k • written 6 months ago by kayrouz.10
2
gravatar for vkkodali
6 months ago by
vkkodali1.1k
United States
vkkodali1.1k wrote:

You can use Entrez Direct for this as follows:

esummary -db cdd -id 240628 | xtract -pattern DocumentSummary -element Id,Subtitle
240628  Phosphoglycerate dehydrogenase (PGDH) NAD-binding and catalytic domains

If you have a file with a long list of identifiers, you can use epost to first upload that list as follows:

epost -db cdd -input <file.txt> | esummary -db cdd | xtract -pattern DocumentSummary -element Id,Subtitle
ADD COMMENTlink written 6 months ago by vkkodali1.1k

You have solved my problem, thank you!

ADD REPLYlink written 6 months ago by kayrouz.10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 544 users visited in the last hour