efetch by UID list returns more UIDs than requested
2
0
Entering edit mode
3.7 years ago
ors9 • 0

I'm trying to download and parse a large amount of peptide sequences from NCBI using the Entrez eutils. The requests are done using efetch, and spacified by a list of UIDs. My problem is that efetch returns more UIDs than those requested, sometimes many more.

For example when requesting the following list of 20 UIDs:

['P15712.1', 'NP_001003799.1', 'AAG26087.1', 'AAB70738.1', 'P0A564.2', 'P06914.1', 'P16753.1', 'P19544.2', 
'AAF41719.1', 'NP_034269.2', 'Q03145.2', 'P59594.1', 'P43357.1', 'CAB36970.1', 'P12582.1', 'P04637.2', 'P03149.1', 
'YP_002608275.1', 'P40967.2', 'Q16385.2']

Efetch responds with a list of 2235 sequences (a 30 Mb xml file), with the requested 20 peptides strawn somewhere inside. Such response sizes slow down my program, and require extra work to sift through all of the results for the sequences that were actually requested.

This is the example request (warning large file download):

https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&rettype=gp&retmode=xml&complexity=4&id=P15712.1,NP_001003799.1,AAG26087.1,AAB70738.1,P0A564.2,P06914.1,P16753.1,P19544.2,AAF41719.1,NP_034269.2,Q03145.2,P59594.1,P43357.1,CAB36970.1,P12582.1,P04637.2,P03149.1,YP_002608275.1,P40967.2,Q16385.2

Note that adding a retmax specifier in the request does not limit the number of returned sequnces, but the number of sequences considered from the request string.

Finally to make things more concrete: requesting specifc UIDs such as AAF41719.1 returns a large amount of unrequested data. Is there a way to limit efetch by UID to return only the requested UID?

entrez eutils • 1.2k views
ADD COMMENT
0
Entering edit mode

complexity parameter seems to have only the following options in EntrezDirect version. Not sure where you got 4 from.

  -complexity    0 = default, 1 = bioseq, 3 = nuc-prot set
ADD REPLY
0
Entering edit mode

According to NCBI's efetch page, complexity=4 should return a minimal pub-set. While I'm not sure what that means, changing the complexity of a request for AAF41719.1 only from 4 to 3 results in a larger response file (by about 10Mb) and includes additional info for each entry such as GBSeq_secondary-accessions.

ADD REPLY
1
Entering edit mode
3.7 years ago
ors9 • 0

After a bit more digging around according to @genomax's answer, I found that removing the complexity parameter in my requests narrows the response down to one result in the case of AAF41719.1 and to the desired number of results in the case of the sample list.

Thanks a bunch, genomax!

ADD COMMENT
2
Entering edit mode
3.7 years ago
GenoMax 141k

I only get one result when I ask for that Accession number and following formats.

$ efetch -db protein -id "AAF41719.1" -format acc
AAF41719.1

$ efetch -db protein -id "AAF41719.1" -format fasta
>AAF41719.1 pyruvate dehydrogenase, E3 component, lipoamide dehydrogenase [Neisseria meningitidis MC58]
MALVELKVPDIGGHENVDIIAVEVNVGDTIAVDDTLITLETDKATMDVPAEVAGVVKEVKVKVGDKISEG
GLIVVVEAEGTAAAPKAEAAAAPAQEAPKAAAPAPQAAQFGGSADAEYDVVVLGGGPGGYSAAFAAADEG

UPDATE: Asking for xml format appears to return data for more than AAF41719 instead of that specific accession.

$ efetch -db protein -id "AAF41719" -format native -mode xml | grep Textseq-id_accession testxml2 | wc -l
    2064

OR

$ esearch -db protein -query "AAF41719"| efetch -format native -mode xml > testxml2
$ grep Textseq-id_accession testxml2 | wc -l
    2064

With other accessions you get more reasonable results

$ esearch -db protein -query "Q16385"| efetch -format native -mode xml > testxml2
$ grep Textseq-id_accession testxml2
              <Textseq-id_accession>Q16385</Textseq-id_accession>

$ esearch -db protein -query "CAB36970"| efetch -format native -mode xml > testxml2
$ grep Textseq-id_accession testxml2
                      <Textseq-id_accession>X79200</Textseq-id_accession>
                      <Textseq-id_accession>CAB36970</Textseq-id_accession>
ADD COMMENT
0
Entering edit mode

Sorry for the late reply, however it seems not to work in my case. After downloading NCBI's response this happens:

cat /tmp/mozilla_userX/sequence.fasta | grep | wc -l returns 2064 results.

ADD REPLY
0
Entering edit mode

Were you requesting XML format? We know you will get that many if you do. If you request fasta format you should get one sequence.

I was using Entrezdirect on the command line (not the web version) so it may differ from using eutil version.

ADD REPLY
0
Entering edit mode

I tried both formats using the webservice (see the response above). Both retrieved over 2000 results (the grep searched for entries by >, I missed that). But you didn't set a complexity for your request, and it worked when I tried that (see the other answer).

ADD REPLY

Login before adding your answer.

Traffic: 2743 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6