Retrieve genbank viral genomes
1
9
Entering edit mode
4.5 years ago
erwan.scaon ▴ 830

Hi dear community !

Ps : The following question was ofc googled, I came across two biostars posts (see below), but I still need some enlightenments : How to choose NCBI viral database?, How to create a Blast database of viruses ?.

For a metagenomic analysis, I'd like to locally retrieve all bacterial, fungal & viral genomes. Thus I am targeting NCBI genbank (and not RefSeq).

Short description of the process :

In the ncbi genbank directory : ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/, we can see : bacteria/, fungi/, viral/. Applying the recipes for the bacteria/ & fungi/ directory was pretty straightforward :

Things get more complicated for the ftp://ftp.ncbi.nlm.nih.gov/genomes/genbank/viral/ directory :

• It does have an assembly_summary.txt file, but it only contain 3 records (for uncultured human fecal virus). There is no other relevant stuff in this directory.
• If you browse the ftp, you will find : ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/. It seems to be a legacy directory, but it does contain a lot of things, so let's try our luck. There is no assembly_summary.txt in here. But there is an all.fna.tar.gz file, which looks like what we are looking for.
• This file contains 4374 directories (each corresponding to a different virus), inside those directories there is a total of 5840 FNA files (some virus have more than 1 associated sequence).
• Retrieve sequences : wget ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/all.fna.tar.gz; tar -zxvf all.fna.tar.gz; find . -name '*.fna' -exec cat {} \; > ncbi_genome_viruses.fasta;

Let's compare this ncbi_genome_viruses.fasta file with the RefSeq virus :

• Access RefSeq for viruses : ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/
• When you cat viral.1.1 & viral.2.1 genomic.fna files, you obtain a file containing 9334 sequences.
• Comparing this "RefSeq" file with the "genome" file : 9334 vs 5840 sequences, 5719 vs 4220 complete genome sequences. The "genome" file was supposed to contain more files, not the other way around. So there is an issue here.

Last ressource available to my knowledge : https://www.ncbi.nlm.nih.gov/genome/viruses/

• "Complete RefSeq release of viral and viroid sequences" <=> the link we previously used for RefSeq sequences (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral/)
• "Accession list of all viroid genomes" (not interested)
• "Accession list of all viral genomes", which point to a file containing 114949 entries (accession number).

Final questions / options :

Best regards

0
Entering edit mode

viral.1.1 & viral.2.1 contain entries such as:

>ref|NC_021094.1| White clover cryptic virus 2 isolate IPP_Lirepa segment RNA 1, complete sequence
>ref|NC_021095.1| White clover cryptic virus 2 isolate IPP_Lirepa segment RNA 2, complete sequence
>ref|NC_021096.1| Red clover cryptic virus 2 isolate IPP_Nemaro segment RNA 1, complete sequence
>ref|NC_021097.1| Red clover cryptic virus 2 isolate IPP_Nemaro segment RNA 2, complete sequence
>ref|NC_021098.1| Hop trefoil cryptic virus 2 isolate IPP_GelbSK segment RNA 1, complete sequence
>ref|NC_021099.1| Hop trefoil cryptic virus 2 isolate IPP_GelbSK segment RNA 2, complete sequence


"Accession list of all viral genomes" has that many entries, but it's a neigbours file. When you sort -u on first column you're left with 9,096 entries. Meanwhile EBI lists 4,026 complete virus genomes.

I think you should be perfectly fine with ftp://ftp.ncbi.nlm.nih.gov/genomes/Viruses/all.fna.tar.gz It's not a legacy dir. The last time that file was updated was today..

2
Entering edit mode
4.5 years ago
erwan.scaon ▴ 830

When you go to https://www.ncbi.nlm.nih.gov/genome/viruses => "Accession list of all viral genomes" => taxid10239.nbr, this is indeed a neighbours file. I think it contains a little more than 9096 entries, because some lines have multiples accession numbers :

• awk -F "\t" '!/^#/ {print $1}' taxid10239.nbr > ncbi_genome_viruses_allhost.txt; • sed -i 's/,/\n/g' ncbi_genome_viruses_allhost.txt; • cat ncbi_genome_viruses_allhost.txt | sort | uniq > ncbi_genome_viruses_allhost_AN.txt; • sed -i '/^$/d' ncbi_genome_viruses_allhost_AN.txt;
• wc -l ncbi_genome_viruses_allhost_AN.txt; => 9216

Regarding the all.fna.tar.gz file, I still have some doubts, esp when we compared it to the RefSeq file :

Thus I plan to use the refseq_file (ftp://ftp.ncbi.nlm.nih.gov/refseq/release/viral (*genomic.fna.gz)), since it seems to contains all entries in the genome_file. But this still doesn't look fine to me, since I was hoping for "a true" genome/genbank file, i.e. a file with significantly more sequences than the RefSeq file.

1
Entering edit mode

Hi erwan.scoan,

check out this research article ( https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3324519/ ). It has a complied list of 32102 viral genomes from GenBank, you can use it directly. Hope that helps!!

0
Entering edit mode

Hello bioinfo89

Just edited the link to be accessible!