I'm trying to download and parse a large amount of peptide sequences from NCBI using the Entrez eutils. The requests are done using efetch, and spacified by a list of UIDs. My problem is that efetch returns more UIDs than those requested, sometimes many more.
For example when requesting the following list of 20 UIDs:
['P15712.1', 'NP_001003799.1', 'AAG26087.1', 'AAB70738.1', 'P0A564.2', 'P06914.1', 'P16753.1', 'P19544.2', 'AAF41719.1', 'NP_034269.2', 'Q03145.2', 'P59594.1', 'P43357.1', 'CAB36970.1', 'P12582.1', 'P04637.2', 'P03149.1', 'YP_002608275.1', 'P40967.2', 'Q16385.2']
Efetch responds with a list of 2235 sequences (a 30 Mb xml file), with the requested 20 peptides strawn somewhere inside. Such response sizes slow down my program, and require extra work to sift through all of the results for the sequences that were actually requested.
This is the example request (warning large file download):
Note that adding a
retmax specifier in the request does not limit the number of returned sequnces, but the number of sequences considered from the request string.
Finally to make things more concrete: requesting specifc UIDs such as
AAF41719.1 returns a large amount of unrequested data. Is there a way to limit efetch by UID to return only the requested UID?