Question: efetch by UID list returns more UIDs than requested
0
gravatar for ors9
7 weeks ago by
ors90
ors90 wrote:

I'm trying to download and parse a large amount of peptide sequences from NCBI using the Entrez eutils. The requests are done using efetch, and spacified by a list of UIDs. My problem is that efetch returns more UIDs than those requested, sometimes many more.

For example when requesting the following list of 20 UIDs:

['P15712.1', 'NP_001003799.1', 'AAG26087.1', 'AAB70738.1', 'P0A564.2', 'P06914.1', 'P16753.1', 'P19544.2', 
'AAF41719.1', 'NP_034269.2', 'Q03145.2', 'P59594.1', 'P43357.1', 'CAB36970.1', 'P12582.1', 'P04637.2', 'P03149.1', 
'YP_002608275.1', 'P40967.2', 'Q16385.2']

Efetch responds with a list of 2235 sequences (a 30 Mb xml file), with the requested 20 peptides strawn somewhere inside. Such response sizes slow down my program, and require extra work to sift through all of the results for the sequences that were actually requested.

This is the example request (warning large file download):

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=gp&retmode=xml&complexity=4&id=P15712.1,NP_001003799.1,AAG26087.1,AAB70738.1,P0A564.2,P06914.1,P16753.1,P19544.2,AAF41719.1,NP_034269.2,Q03145.2,P59594.1,P43357.1,CAB36970.1,P12582.1,P04637.2,P03149.1,YP_002608275.1,P40967.2,Q16385.2

Note that adding a retmax specifier in the request does not limit the number of returned sequnces, but the number of sequences considered from the request string.

Finally to make things more concrete: requesting specifc UIDs such as AAF41719.1 returns a large amount of unrequested data. Is there a way to limit efetch by UID to return only the requested UID?

eutils entrez • 168 views
ADD COMMENTlink modified 6 weeks ago • written 7 weeks ago by ors90

complexity parameter seems to have only the following options in EntrezDirect version. Not sure where you got 4 from.

  -complexity    0 = default, 1 = bioseq, 3 = nuc-prot set
ADD REPLYlink modified 6 weeks ago • written 7 weeks ago by genomax89k

According to NCBI's efetch page, complexity=4 should return a minimal pub-set. While I'm not sure what that means, changing the complexity of a request for AAF41719.1 only from 4 to 3 results in a larger response file (by about 10Mb) and includes additional info for each entry such as GBSeq_secondary-accessions.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by ors90
1
gravatar for ors9
6 weeks ago by
ors90
ors90 wrote:

After a bit more digging around according to @genomax's answer, I found that removing the complexity parameter in my requests narrows the response down to one result in the case of AAF41719.1 and to the desired number of results in the case of the sample list.

Thanks a bunch, genomax!

ADD COMMENTlink written 6 weeks ago by ors90
2
gravatar for genomax
7 weeks ago by
genomax89k
United States
genomax89k wrote:

I only get one result when I ask for that Accession number and following formats.

$ efetch -db protein -id "AAF41719.1" -format acc
AAF41719.1

$ efetch -db protein -id "AAF41719.1" -format fasta
>AAF41719.1 pyruvate dehydrogenase, E3 component, lipoamide dehydrogenase [Neisseria meningitidis MC58]
MALVELKVPDIGGHENVDIIAVEVNVGDTIAVDDTLITLETDKATMDVPAEVAGVVKEVKVKVGDKISEG
GLIVVVEAEGTAAAPKAEAAAAPAQEAPKAAAPAPQAAQFGGSADAEYDVVVLGGGPGGYSAAFAAADEG

UPDATE: Asking for xml format appears to return data for more than AAF41719 instead of that specific accession.

$ efetch -db protein -id "AAF41719" -format native -mode xml | grep Textseq-id_accession testxml2 | wc -l
    2064

OR

$ esearch -db protein -query "AAF41719"| efetch -format native -mode xml > testxml2
$ grep Textseq-id_accession testxml2 | wc -l
    2064

With other accessions you get more reasonable results

$ esearch -db protein -query "Q16385"| efetch -format native -mode xml > testxml2
$ grep Textseq-id_accession testxml2
              <Textseq-id_accession>Q16385</Textseq-id_accession>

$ esearch -db protein -query "CAB36970"| efetch -format native -mode xml > testxml2
$ grep Textseq-id_accession testxml2
                      <Textseq-id_accession>X79200</Textseq-id_accession>
                      <Textseq-id_accession>CAB36970</Textseq-id_accession>
ADD COMMENTlink modified 7 weeks ago • written 7 weeks ago by genomax89k

Sorry for the late reply, however it seems not to work in my case. After downloading NCBI's response this happens:

cat /tmp/mozilla_userX/sequence.fasta | grep | wc -l returns 2064 results.

ADD REPLYlink modified 6 weeks ago • written 6 weeks ago by ors90

Were you requesting XML format? We know you will get that many if you do. If you request fasta format you should get one sequence.

I was using Entrezdirect on the command line (not the web version) so it may differ from using eutil version.

ADD REPLYlink written 6 weeks ago by genomax89k

I tried both formats using the webservice (see the response above). Both retrieved over 2000 results (the grep searched for entries by >, I missed that). But you didn't set a complexity for your request, and it worked when I tried that (see the other answer).

ADD REPLYlink written 6 weeks ago by ors90
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1779 users visited in the last hour