Question: How to get total count of organisms with whole genome sequenced
0
gravatar for ruchikabhat31
3 months ago by
ruchikabhat3130 wrote:

Hi all,

I checked the NCBI FTP site: ftp://ftp.ncbi.nih.gov/genomes/ Here the no. of organisms reported is approximately 389 (for eukaryotes I guess) and there is separate directory for viruses However, this link: https://www.ncbi.nlm.nih.gov/genome/browse/ shows something 7313 for prokaryotes (if I keep only complete genome) and 35 for eukaryotes (keeping complete genome). and 7150 for viruses. So what data should one report as total number of organisms sequenced till date and submitted to NCBI? If anyone can help me with the number and source (with breakage of Eukaryotes,Prokaryotes and Virusesis is even better).

Thanks all.

Ruchika

organisms genome ncbi • 212 views
ADD COMMENTlink modified 3 months ago by shenwei3563.1k • written 3 months ago by ruchikabhat3130
2
gravatar for a.zielezinski
3 months ago by
a.zielezinski7.0k
a.zielezinski7.0k wrote:

See The Genomes OnLine Database (GOLD). It is a web-based resource for comprehensive information regarding genome and metagenome sequencing projects, and their associated metadata, around the world.

The database provides all the statistics for various statuses of sequencing projects, such as:

  • complete genomes
  • complete and published genomes
  • permanent drafts
  • incomplete projects
  • abandoned projects
ADD COMMENTlink modified 3 months ago • written 3 months ago by a.zielezinski7.0k
1
gravatar for piet
3 months ago by
piet1.4k
planet earth
piet1.4k wrote:

The principle problem is how you define the term "complete genome". For most organisms it is still impossible to obtain complete seqences of all replicons in a cell. The outcome of sequencing experiments is limited as well by the method used for nucleic acid extraction as by the sequencing approach.

It will depend on the context of your research how to define "complete genome" appropriately. You should write down you own definition first, and then check whether available genomes fit your definition.

ADD COMMENTlink modified 3 months ago • written 3 months ago by piet1.4k

By complete I mean whole genome sequence has been sequenced. The same way the sequencing projects are termed as complete genomes, short contigs etc. NCBI has the terminology for reference genome for the ones that have been sequenced fully and are curated manually.

Not to confuse you more by complete I mean where the genome has been fully sequenced and reported for public usage. Many thanks.

ADD REPLYlink written 3 months ago by ruchikabhat3130

By complete I mean whole genome sequence has been sequenced.

Human genome has bee sequenced since early 2000's but people are still working on refining it and parts are certainly intractable to sequencing with past/current technologies.

complete I mean where the genome has been fully sequenced and reported for public usage

Then why not take the entire list from NCBI genomes.

ADD REPLYlink written 3 months ago by genomax30k

Yes, but which value to take is my question as the FTP site has given different values like I mentioned in the main question. If I check through FTP on a given date the number is way different than the link https://www.ncbi.nlm.nih.gov/genome/browse/

That's what I have asked in the main question too. Please guide Thanks

ADD REPLYlink written 3 months ago by ruchikabhat3130
1
gravatar for shenwei356
3 months ago by
shenwei3563.1k
China
shenwei3563.1k wrote:

I just counted the bacterial species with complete genomes according to the ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/bacteria/assembly_summary.txt , it's 2517.

$ cat assembly_summary.txt | grep "Complete Genome" | cut -f 7 | sort | uniq | wc -l 
2517
ADD COMMENTlink written 3 months ago by shenwei3563.1k

Similar lines I checked for refseqs $ cat assembly_summary_refseq.txt | grep "Complete Genome" | cut -f 7 | sort | uniq | wc -l 9788

Do you find reporting 9788 organisms would be an authentic data ?

ADD REPLYlink written 3 months ago by ruchikabhat3130

And Genbank file:

$ cat assembly_summary_genbank.txt | grep "Complete Genome" | cut -f 7 | sort | uniq | wc -l 2858 which one to report?

ADD REPLYlink written 3 months ago by ruchikabhat3130

You realize that these numbers are subject to change (perhaps daily). New data is added each night to NCBI/GenBank.

Why not report both with appropriate notes.

ADD REPLYlink written 3 months ago by genomax30k

I just need to cite one number as total organisms for whole genome sequenced. Citing two different numbers will create chaos.

ADD REPLYlink written 3 months ago by ruchikabhat3130

If you strictly need one number then report 2858 that you came up with above. You will have to qualify that indicating that you are only counting one entry per taxid. Other number that could be reported would be 7412 (which does not take uniq entries for taxid).

Citing two different numbers will create chaos.

Where :)

ADD REPLYlink modified 3 months ago • written 3 months ago by genomax30k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 465 users visited in the last hour