Question: Retrieve all ids from NCBI
0
gravatar for ypriverol
9 months ago by
ypriverol0
ypriverol0 wrote:

Hi all: Does any body know the way to know all Bioprojects Ids from NCBI?

Regards Yasset

ncbi • 461 views
ADD COMMENTlink modified 9 months ago by Pierre Lindenbaum108k • written 9 months ago by ypriverol0
1
gravatar for genomax
9 months ago by
genomax50k
United States
genomax50k wrote:

With NCBI eUtils:

esearch -query "P*" -db bioproject | efetch -format docsum | xtract -pattern DocumentSummary -element Project_Acc Project_Title

produces

PRJNA403305     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 3 transcriptome
PRJNA403304     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 2 transcriptome
PRJNA403303     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 1 transcriptome
PRJNA403302     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Al 3 transcriptome
PRJNA403301     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Al 2 transcriptome
ADD COMMENTlink modified 9 months ago • written 9 months ago by genomax50k

Hi @genomax thanks for your quick answer. Do you know a way to do it programmatically. I found this one https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&term=all%5Bfilter%5D&retmax=999999 but I don't know if is the best one.

Regards Yasset

ADD REPLYlink written 9 months ago by ypriverol0

I wondered just how many bioprojects are there in total. Running the search on its own tells us that:

esearch -query "P*" -db bioproject

prints:

<ENTREZ_DIRECT>
  <Db>bioproject</Db>
  <WebEnv>NCID_1_18646926_130.14.22.215_9001_1505146040_1821616193_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>10454</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

so there are 10454 bioprojects at NCBI.

ADD REPLYlink modified 9 months ago • written 9 months ago by Istvan Albert ♦♦ 77k
1

Amusingly after doing some investigation, I came to believe that a wildcard search at NCBI does not do what you and I and most people think that a wildcard search should be doing.

What it does instead is that it creates an expanded search query that includes all terms that match the wildcard. So P*[Project Accession] will create and run the search:

phs000001[Project Accession] OR phs000004[Project Accession] OR phs000005[Project Accession] OR
phs000007[Project Accession] OR phs000016[Project Accession] OR phs000017[Project Accession] OR
phs000018[Project Accession] OR phs000019[Project Accession] OR phs000020[Project Accession] OR
phs000021[Project Accession] OR phs000048[Project Accession] OR phs000086[Project Accession] OR 
phs000088[Project Accession] OR phs000089[Project Accession] OR phs000090[Project Accession] OR
phs000091[Project Accession] OR phs000092[Project Accession] OR phs000093[Project Accession] OR
phs000094[Project Accession] OR phs000095[Project Accession] OR phs000096[Project Accession] OR
phs000100[Project Accession] OR phs000101[Project Accession] OR phs000102[Project Accession] OR
phs000103[Project ...

and so on and on until a predefined string limit size is reached. That's why it returns only a subset of results.

To more we know ...

ADD REPLYlink modified 9 months ago • written 9 months ago by Istvan Albert ♦♦ 77k

According to this page there are 228784 entries (as of today). So perhaps there are some that are not being captured by this query. Every project ID does appear to start with PR*. Mysteries of eUtils.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax50k

Interesting, the perils of matching on names. Good to know.

ADD REPLYlink written 9 months ago by Istvan Albert ♦♦ 77k

Does not make complete sense. Every project name starts with P but there are different answers depending on where/how we look. See my comment below @Pierre's answer.

ADD REPLYlink modified 9 months ago • written 9 months ago by genomax50k

This is strange:

Your results are: 10454

My results with the url are: 246934 (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&term=all%5Bfilter%5D&retmax=999999)

The results in their browser are: 228784 (https://www.ncbi.nlm.nih.gov/bioproject/browse/)

ADD REPLYlink modified 9 months ago • written 9 months ago by ypriverol0
1
gravatar for Pierre Lindenbaum
9 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum108k wrote:

All (ID, organism, date...) is available in ftp://ftp.ncbi.nlm.nih.gov/bioproject/summary.txt

ADD COMMENTlink written 9 months ago by Pierre Lindenbaum108k
1
228784 summary.txt

Seems to match the number obtained from browser.

But information about Bioprojects databases gets you this

<DbInfo>
        <DbName>bioproject</DbName>
        <MenuName>BioProject</MenuName>
        <Description>BioProject Database</Description>
        <DbBuild>Build170911-0610.1</DbBuild>
        <Count>246934</Count>
        <LastUpdate>2017/09/11 07:02</LastUpdate>
ADD REPLYlink modified 9 months ago • written 9 months ago by genomax50k

Thanks for your quick response.

ADD REPLYlink written 9 months ago by ypriverol0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1566 users visited in the last hour