What does 'complete genome' in NCBI include
1
0
Entering edit mode
5.3 years ago

Hey,

I have a quite basic question but do not really find an answer online. When an NCBI sequence is denoted as 'complete genome', what does it actually contain? Assuming we have a bacterial sequence, will it contain only the chromosomal sequence? or does it contain chromosomal and plasmid sequences, and thus the complete DNA found in the cell?

ncbi complete genome bioinformatics • 1.7k views
0
Entering edit mode
5.3 years ago
5heikki 10k

Applies to all assemblies:

   *_genomic.fna.gz file
FASTA format of the genomic sequence(s) in the assembly. Repetitive
sequences in eukaryotes are masked to lower-case (see below).
The FASTA title is formatted as sequence accession.version plus
description. The genomic.fna.gz file includes all top-level sequences in
the assembly (chromosomes, plasmids, organelles, unlocalized scaffolds,
unplaced scaffolds, and any alternate loci or patch scaffolds). Scaffolds
that are part of the chromosomes are not included because they are
redundant with the chromosome sequences; sequences for these placed
scaffolds are provided under the assembly_structure directory.


0
Entering edit mode

While it says that it may not be true (any longer?). See the related thread by @wanderingstefan: Download complete bacterial genomes and associated plasmid sequences from NCBI

0
Entering edit mode

Can you name at least one example where the above does not apply?

0
Entering edit mode

Since @wanderingstefan had posted this and other thread I (wrongly) assumed that it was done after due diligence. On double checking it does look like the "genomic.fna.gz" file contains associated plasmid sequences.

0
Entering edit mode

His problem was going through entrez. I'm pretty sure nobody even at the NCBI knows comprehensively how entrez queries work. At least it's not documented fully anywhere.

0
Entering edit mode

I am a little confused here. Does the above answer also apply to complete genomes downloaded from the 'nucleotide' database at ncbi? My statement that plasmid sequences are not contained in the 'complete genome' files from the 'nucleotide' database was based on a blast search of some whole genomes against a blast database containing the sequences of all plasmids at the ncbi refseq and thereafter calculating sequence coverage for the plasmids. May I ask how you checked @genomax2?

edit: I have to add that I terminated the analysis after around 100 random genomes, as I was unable to identify plasmids in any of them. I will check this again.

0
Entering edit mode

It applies when you look your assemblies of interest from this large file (do not open in browser!) and then download the "*_genomic.fna.gz" file that can be found from within the ftp directory specified by column 20 of said file.

0
Entering edit mode

Hey, thanks for the clarification. Yes, for those files all plasmid sequences are in there, I also found it and downloaded the suitable assemblies.