I'm trying to download and parse a large amount of peptide sequences from NCBI using the Entrez eutils. The requests are done using efetch, and spacified by a list of UIDs. My problem is that efetch returns more UIDs than those requested, sometimes many more.
For example when requesting the following list of 20 UIDs:
['P15712.1', 'NP_001003799.1', 'AAG26087.1', 'AAB70738.1', 'P0A564.2', 'P06914.1', 'P16753.1', 'P19544.2',
'AAF41719.1', 'NP_034269.2', 'Q03145.2', 'P59594.1', 'P43357.1', 'CAB36970.1', 'P12582.1', 'P04637.2', 'P03149.1',
'YP_002608275.1', 'P40967.2', 'Q16385.2']
Efetch responds with a list of 2235 sequences (a 30 Mb xml file), with the requested 20 peptides strawn somewhere inside. Such response sizes slow down my program, and require extra work to sift through all of the results for the sequences that were actually requested.
This is the example request (warning large file download):
https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=gp&retmode=xml&complexity=4&id=P15712.1,NP_001003799.1,AAG26087.1,AAB70738.1,P0A564.2,P06914.1,P16753.1,P19544.2,AAF41719.1,NP_034269.2,Q03145.2,P59594.1,P43357.1,CAB36970.1,P12582.1,P04637.2,P03149.1,YP_002608275.1,P40967.2,Q16385.2
Note that adding a retmax
specifier in the request does not limit the number of returned sequnces, but the number of sequences considered from the request string.
Finally to make things more concrete: requesting specifc UIDs such as AAF41719.1
returns a large amount of unrequested data. Is there a way to limit efetch by UID to return only the requested UID?
complexity
parameter seems to have only the following options in EntrezDirect version. Not sure where you got4
from.According to NCBI's efetch page,
complexity=4
should return aminimal pub-set
. While I'm not sure what that means, changing the complexity of a request forAAF41719.1
only from 4 to 3 results in a larger response file (by about 10Mb) and includes additional info for each entry such asGBSeq_secondary-accessions
.