I have a large number (>100,000) of functionally unrelated protein sequences and I want to generate a roughly functional annotation for each. I've tried PROKKA and RASTtk, but they tend to return a large number of "hypothetical" results so I've changed my approach. So far I have used RPS-BLAST to query each against the NCBI conserved domain database, which successfully gives me the NCBI-CDD identifier of the best hit for each. Now I want to retrieve the "name" associated with each of these identifiers.
What I've tried so far:
I currently have a NCBI-CDD identifier that encodes the top domain hit for each of my sequences. The identifiers take the form of a 6-digit number (i.e. 240628). I want to retrieve the "name" associated with each of these identifiers (for 240628, the name is "Phosphoglycerate dehydrogenase (PGDH) NAD-binding and catalytic domains").
In the past I have been able to retrieve information from NCBI using the command line to cycle through EFetch commands and write the output to a file, such as shown below:
However, when I try this approach with CDD identifiers, I just get a web-based XML readout that I can't really work with. See below for what I mean:
Is there any way to structure this command such that I receive a downloadable file that I can extract information from?