Number of sequences in RefSeq.
2
0
Entering edit mode
18 months ago
poet1988 ▴ 30

Dear colleagues I can not understand. When I download all the genomic sequences from the refseq database, after counting, I see that there are much fewer records than presented in the release (123394 organisms https://ftp.ncbi.nlm.nih.gov/refseq/release/release-notes/RefSeq-release214.txt). What am I doing wrong?

1. wget ftp://ftp.ncbi.nlm.nih.gov/genomes/refseq/assembly_summary_refseq.txt   
2. awk -F "\t" '$12=="Complete Genome" && $11=="latest"{print $20}' assembly_summary_refseq.txt > ftpdirpaths           
3. awk 'BEGIN{FS=OFS="/";filesuffix="genomic.fna.gz"}{ftpdir=$0;asm=$10;file=asm"_"filesuffix;print ftpdir,file}' ftpdirpaths > ftpfilepaths 
4.wget  -i ftpfilepaths
5. gunzip *.gz
6. cat *.fna > /media/sf_G_DRIVE/DataBase/RefSeq_All_2022/Refseq_214.fasta

grep -c '>' Refseq_214.fasta 
79560
Refseq • 942 views
ADD COMMENT
3
Entering edit mode
18 months ago

RefSeq classifies each genome into one of the following assembly level categories: Complete Genome, Chromosome, Scaffold, Contig. Because your code downloads only complete genomes, the number of downloaded sequences is smaller than the number provided in the RefSeq statistics. Of note, the majority of RefSeq genomes (~80%) are assembled at the scaffold and contig levels.

For downloading RefSeq genomes I recommend using genome_updater. It is a bash script that allows you to download genomes from RefSeq or GenBank with many filters (e.g., according to different assembly levels or taxonomic units). The script tracks changes (it only downloads updated genomes since your last download), allows multithreading, and it has a file integrity check.

ADD COMMENT
0
Entering edit mode

Thank you very much for your answer, Andrzej. I found this solution - ncbi-genome-download Thanks for the link, I'm interested in testing.

ADD REPLY
2
Entering edit mode
18 months ago
Michael 54k

That means: there exist RefSeq sequences which are not contained in the set of RefSeq genomes. This is totally expected.

From the release notes:

2.2 Molecule Types Included

The RefSeq release includes genomic, transcript, and protein sequence data; however, these molecule types are not provided for all organisms and the sequences provided may not be complete or comprehensive for some species.

Transcript RefSeq records may represent protein-coding transcripts or non-coding RNA products; these records are currently only provided for eukaryotic species.

Genomic RefSeq records are provided when a sufficient quantity of genomic sequence data is available in DDBJ/EMBL/GenBank. Transcript and protein records may be provided for a species before genomic sequence data is available.

ADD COMMENT
0
Entering edit mode

Thank you very much for your answer, Michael. I understand now!

ADD REPLY

Login before adding your answer.

Traffic: 1997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6