Retrieve all ids from NCBI
2
0
Entering edit mode
6.6 years ago
ypriverol • 0

Hi all: Does any body know the way to know all Bioprojects Ids from NCBI?

Regards Yasset

ncbi • 2.2k views
ADD COMMENT
1
Entering edit mode
6.6 years ago
GenoMax 141k

With NCBI eUtils:

esearch -query "P*" -db bioproject | efetch -format docsum | xtract -pattern DocumentSummary -element Project_Acc Project_Title

produces

PRJNA403305     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 3 transcriptome
PRJNA403304     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 2 transcriptome
PRJNA403303     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Fe 1 transcriptome
PRJNA403302     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Al 3 transcriptome
PRJNA403301     Penicillium aculeatus Gene Expression Profiling - P-Pe223 Al 2 transcriptome
ADD COMMENT
0
Entering edit mode

Hi @genomax thanks for your quick answer. Do you know a way to do it programmatically. I found this one https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&term=all%5Bfilter%5D&retmax=999999 but I don't know if is the best one.

Regards Yasset

ADD REPLY
0
Entering edit mode

I wondered just how many bioprojects are there in total. Running the search on its own tells us that:

esearch -query "P*" -db bioproject

prints:

<ENTREZ_DIRECT>
  <Db>bioproject</Db>
  <WebEnv>NCID_1_18646926_130.14.22.215_9001_1505146040_1821616193_0MetA0_S_MegaStore_F_1</WebEnv>
  <QueryKey>1</QueryKey>
  <Count>10454</Count>
  <Step>1</Step>
</ENTREZ_DIRECT>

so there are 10454 bioprojects at NCBI.

ADD REPLY
1
Entering edit mode

Amusingly after doing some investigation, I came to believe that a wildcard search at NCBI does not do what you and I and most people think that a wildcard search should be doing.

What it does instead is that it creates an expanded search query that includes all terms that match the wildcard. So P*[Project Accession] will create and run the search:

phs000001[Project Accession] OR phs000004[Project Accession] OR phs000005[Project Accession] OR
phs000007[Project Accession] OR phs000016[Project Accession] OR phs000017[Project Accession] OR
phs000018[Project Accession] OR phs000019[Project Accession] OR phs000020[Project Accession] OR
phs000021[Project Accession] OR phs000048[Project Accession] OR phs000086[Project Accession] OR 
phs000088[Project Accession] OR phs000089[Project Accession] OR phs000090[Project Accession] OR
phs000091[Project Accession] OR phs000092[Project Accession] OR phs000093[Project Accession] OR
phs000094[Project Accession] OR phs000095[Project Accession] OR phs000096[Project Accession] OR
phs000100[Project Accession] OR phs000101[Project Accession] OR phs000102[Project Accession] OR
phs000103[Project ...

and so on and on until a predefined string limit size is reached. That's why it returns only a subset of results.

To more we know ...

ADD REPLY
0
Entering edit mode

According to this page there are 228784 entries (as of today). So perhaps there are some that are not being captured by this query. Every project ID does appear to start with PR*. Mysteries of eUtils.

ADD REPLY
0
Entering edit mode

Interesting, the perils of matching on names. Good to know.

ADD REPLY
0
Entering edit mode

Does not make complete sense. Every project name starts with P but there are different answers depending on where/how we look. See my comment below @Pierre's answer.

ADD REPLY
0
Entering edit mode

This is strange:

Your results are: 10454

My results with the url are: 246934 (https://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=bioproject&term=all%5Bfilter%5D&retmax=999999)

The results in their browser are: 228784 (https://www.ncbi.nlm.nih.gov/bioproject/browse/)

ADD REPLY
1
Entering edit mode
6.6 years ago

All (ID, organism, date...) is available in ftp://ftp.ncbi.nlm.nih.gov/bioproject/summary.txt

ADD COMMENT
1
Entering edit mode
228784 summary.txt

Seems to match the number obtained from browser.

But information about Bioprojects databases gets you this

<DbInfo>
        <DbName>bioproject</DbName>
        <MenuName>BioProject</MenuName>
        <Description>BioProject Database</Description>
        <DbBuild>Build170911-0610.1</DbBuild>
        <Count>246934</Count>
        <LastUpdate>2017/09/11 07:02</LastUpdate>
ADD REPLY
0
Entering edit mode

Thanks for your quick response.

ADD REPLY

Login before adding your answer.

Traffic: 2457 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6