Question

Retrieve all ids from NCBI

0

Entering edit mode

6.6 years ago

ypriverol • 0

Hi all: Does any body know the way to know all Bioprojects Ids from NCBI?

Regards Yasset

ncbi • 2.2k views

ADD COMMENT • link updated 6.6 years ago by Pierre Lindenbaum 161k • written 6.6 years ago by ypriverol • 0

score 1 · Answer 1 · 2017-09-11

1

Entering edit mode

6.6 years ago

GenoMax 141k

With NCBI eUtils:

esearch -query "P*" -db bioproject | efetch -format docsum | xtract -pattern DocumentSummary -element Project_Acc Project_Title

produces

PRJNA403305     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 3 transcriptome
PRJNA403304     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 2 transcriptome
PRJNA403303     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 1 transcriptome
PRJNA403302     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Al 3 transcriptome
PRJNA403301     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Al 2 transcriptome

ADD COMMENT • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

Hi @genomax thanks for your quick answer. Do you know a way to do it programmatically. I found this one https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&term=all%5Bfilter%5D&retmax=999999 but I don't know if is the best one.

Regards Yasset

ADD REPLY • link 6.6 years ago by ypriverol • 0

0

Entering edit mode

I wondered just how many bioprojects are there in total. Running the search on its own tells us that:

esearch -query "P*" -db bioproject

prints:

<ENTREZ_DIRECT>
  <Db>bioproject</Db>
  <WebEnv>NCID_1_18646926_130.14.22.215_9001_1505146040_1821616193_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>10454</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

so there are 10454 bioprojects at NCBI.

ADD REPLY • link 6.6 years ago by Istvan Albert 100k

1

Entering edit mode

Amusingly after doing some investigation, I came to believe that a wildcard search at NCBI does not do what you and I and most people think that a wildcard search should be doing.

What it does instead is that it creates an expanded search query that includes all terms that match the wildcard. So P*[Project Accession] will create and run the search:

phs000001[Project Accession] OR phs000004[Project Accession] OR phs000005[Project Accession] OR
phs000007[Project Accession] OR phs000016[Project Accession] OR phs000017[Project Accession] OR
phs000018[Project Accession] OR phs000019[Project Accession] OR phs000020[Project Accession] OR
phs000021[Project Accession] OR phs000048[Project Accession] OR phs000086[Project Accession] OR 
phs000088[Project Accession] OR phs000089[Project Accession] OR phs000090[Project Accession] OR
phs000091[Project Accession] OR phs000092[Project Accession] OR phs000093[Project Accession] OR
phs000094[Project Accession] OR phs000095[Project Accession] OR phs000096[Project Accession] OR
phs000100[Project Accession] OR phs000101[Project Accession] OR phs000102[Project Accession] OR
phs000103[Project ...

and so on and on until a predefined string limit size is reached. That's why it returns only a subset of results.

To more we know ...

ADD REPLY • link 6.6 years ago by Istvan Albert 100k

0

Entering edit mode

According to this page there are 228784 entries (as of today). So perhaps there are some that are not being captured by this query. Every project ID does appear to start with PR*. Mysteries of eUtils.

ADD REPLY • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

Interesting, the perils of matching on names. Good to know.

ADD REPLY • link 6.6 years ago by Istvan Albert 100k

0

Entering edit mode

Does not make complete sense. Every project name starts with P but there are different answers depending on where/how we look. See my comment below @Pierre's answer.

ADD REPLY • link 6.6 years ago by GenoMax 141k

0

Entering edit mode

This is strange:

Your results are: 10454

My results with the url are: 246934 (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&term=all%5Bfilter%5D&retmax=999999)

The results in their browser are: 228784 (https://www.ncbi.nlm.nih.gov/bioproject/browse/)

ADD REPLY • link 6.6 years ago by ypriverol • 0

score 1 · Answer 2 · 2017-09-11

1

Entering edit mode

6.6 years ago

Pierre Lindenbaum 161k

All (ID, organism, date...) is available in ftp://ftp.ncbi.nlm.nih.gov/bioproject/summary.txt

ADD COMMENT • link 6.6 years ago by Pierre Lindenbaum 161k

1

Entering edit mode

228784 summary.txt

Seems to match the number obtained from browser.

But information about Bioprojects databases gets you this

<DbInfo>
        <DbName>bioproject</DbName>
        <MenuName>BioProject</MenuName>
        <Description>BioProject Database</Description>
        <DbBuild>Build170911-0610.1</DbBuild>
        <Count>246934</Count>
        <LastUpdate>2017/09/11 07:02</LastUpdate>