Question: Retrieve all ids from NCBI
0
gravatar for ypriverol
15 months ago by
ypriverol0
ypriverol0 wrote:

Hi all: Does any body know the way to know all Bioprojects Ids from NCBI?

Regards Yasset

ncbi • 611 views
ADD COMMENTlink modified 15 months ago by Pierre Lindenbaum115k • written 15 months ago by ypriverol0
1
gravatar for genomax
15 months ago by
genomax59k
United States
genomax59k wrote:

With NCBI eUtils:

esearch -query "P*" -db bioproject | efetch -format docsum | xtract -pattern DocumentSummary -element Project_Acc Project_Title

produces

PRJNA403305     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 3 transcriptome
PRJNA403304     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 2 transcriptome
PRJNA403303     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 1 transcriptome
PRJNA403302     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Al 3 transcriptome
PRJNA403301     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Al 2 transcriptome
ADD COMMENTlink modified 15 months ago • written 15 months ago by genomax59k

Hi @genomax thanks for your quick answer. Do you know a way to do it programmatically. I found this one https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&term=all%5Bfilter%5D&retmax=999999 but I don't know if is the best one.

Regards Yasset

ADD REPLYlink written 15 months ago by ypriverol0

I wondered just how many bioprojects are there in total. Running the search on its own tells us that:

esearch -query "P*" -db bioproject

prints:

<ENTREZ_DIRECT>
  <Db>bioproject</Db>
  <WebEnv>NCID_1_18646926_130.14.22.215_9001_1505146040_1821616193_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>10454</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

so there are 10454 bioprojects at NCBI.

ADD REPLYlink modified 15 months ago • written 15 months ago by Istvan Albert ♦♦ 78k
1

Amusingly after doing some investigation, I came to believe that a wildcard search at NCBI does not do what you and I and most people think that a wildcard search should be doing.

What it does instead is that it creates an expanded search query that includes all terms that match the wildcard. So P*[Project Accession] will create and run the search:

phs000001[Project Accession] OR phs000004[Project Accession] OR phs000005[Project Accession] OR
phs000007[Project Accession] OR phs000016[Project Accession] OR phs000017[Project Accession] OR
phs000018[Project Accession] OR phs000019[Project Accession] OR phs000020[Project Accession] OR
phs000021[Project Accession] OR phs000048[Project Accession] OR phs000086[Project Accession] OR 
phs000088[Project Accession] OR phs000089[Project Accession] OR phs000090[Project Accession] OR
phs000091[Project Accession] OR phs000092[Project Accession] OR phs000093[Project Accession] OR
phs000094[Project Accession] OR phs000095[Project Accession] OR phs000096[Project Accession] OR
phs000100[Project Accession] OR phs000101[Project Accession] OR phs000102[Project Accession] OR
phs000103[Project ...

and so on and on until a predefined string limit size is reached. That's why it returns only a subset of results.

To more we know ...

ADD REPLYlink modified 15 months ago • written 15 months ago by Istvan Albert ♦♦ 78k

According to this page there are 228784 entries (as of today). So perhaps there are some that are not being captured by this query. Every project ID does appear to start with PR*. Mysteries of eUtils.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax59k

Interesting, the perils of matching on names. Good to know.

ADD REPLYlink written 15 months ago by Istvan Albert ♦♦ 78k

Does not make complete sense. Every project name starts with P but there are different answers depending on where/how we look. See my comment below @Pierre's answer.

ADD REPLYlink modified 15 months ago • written 15 months ago by genomax59k

This is strange:

Your results are: 10454

My results with the url are: 246934 (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&term=all%5Bfilter%5D&retmax=999999)

The results in their browser are: 228784 (https://www.ncbi.nlm.nih.gov/bioproject/browse/)

ADD REPLYlink modified 15 months ago • written 15 months ago by ypriverol0
1
gravatar for Pierre Lindenbaum
15 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

All (ID, organism, date...) is available in ftp://ftp.ncbi.nlm.nih.gov/bioproject/summary.txt

ADD COMMENTlink written 15 months ago by Pierre Lindenbaum115k
1
228784 summary.txt

Seems to match the number obtained from browser.

But information about Bioprojects databases gets you this

<DbInfo>
        <DbName>bioproject</DbName>
        <MenuName>BioProject</MenuName>
        <Description>BioProject Database</Description>
        <DbBuild>Build170911-0610.1</DbBuild>
        <Count>246934</Count>
        <LastUpdate>2017/09/11 07:02</LastUpdate>
ADD REPLYlink modified 15 months ago • written 15 months ago by genomax59k

Thanks for your quick response.

ADD REPLYlink written 15 months ago by ypriverol0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1695 users visited in the last hour