How to create a Blast database of viruses ?
3
2
Entering edit mode
7.8 years ago
bell ▴ 20

Hi all,

For a metagenomic project a want to make a blast database of viruses. I dont want to blast my reads on the entire nt database. But I dont know how to make it. Downloading the result of a query like viruses[organism] from the nucleotide database of NCBI is impossible, due to the weight of the data. Maybe there is a solution using the taxononomy files for extract sequences of the Fasta nt file available on the ncbi ftp ?

So could anyone give me a solution ?

Thank you very much !

blast sequence • 8.1k views
ADD COMMENT
4
Entering edit mode
7.8 years ago

Download the VRL division of genbank ftp://ftp.ncbi.nih.gov/genbank/gbvrl*.seq.gz and index it with blast

ADD COMMENT
0
Entering edit mode

Hi Pierre, I am doing something similar at the moment, and this looks like a very good solution. Maybe we should mention that the data is in genbank format (obviously) and needs to be converted to fasta before making a blastdb. However, when using BioPerl SeqIO to convert the fasta headers look like this:

>AB000048 Feline panleukopenia virus gene for nonstructural protein 1, complete cds, isolate: 483.

So, no gi's here, but they would be needed to assign taxids for metagenomics, any quick fix to keep the gi?

ADD REPLY
2
Entering edit mode

@Michael use awk?

$ curl -s "ftp://ftp.ncbi.nih.gov/genbank/gbvrl27.seq.gz" | gunzip -c | awk -f jeter.awk

>gi:422089830|Hepatitis C virus isolate V2401 NS5AB replicase gene, partial cds.
tggattaacgaggactgctccacgccatgctccggctcgtggctaaaggatgtttgggac
tggatatgcacggtgctgtctgatttcagaacctggctccagtccaagctcctgccgcgg
ytaccgggagtccctttcttctcgtgtcaacgtggatataagggagtctggcggggygac
ggcatcatgcaaaccacctgttcatgtggggcacagatcaccggacatgtcaaaaacggc
ADD REPLY
0
Entering edit mode

+1 for awk regex tricks :-)

ADD REPLY
0
Entering edit mode

what is jeter.awk ?!

ADD REPLY
0
Entering edit mode

it's the awk script above. jeter in french means "trashed" (a name I use for temporary files)

ADD REPLY
0
Entering edit mode

Hi, it doesn't seem to work with some sequences, i mean some sequences after the scritp just appear empty..

ADD REPLY
0
Entering edit mode

I am just wondering what the correct number of entries is:

  • the gbvrl files converted to fasta files contain 1584206 entries
  • the ncbi query 'Viruses[Organism] NOT cellular organisms[ORGN] NOT AC_000001:AC_999999[pacc]' yields 1741019
  • when I download this query via efetch I retrieve only 1737392 entries
ADD REPLY
3
Entering edit mode
7.8 years ago
Peter 6.0k

Related to Pierre's answer, if you want complete virus genomes, there are FASTA files available at ftp://ftp.ncbi.nih.gov/genomes/Viruses/

You can download complete genomes via NCBI Entrez but that is more problematic, see http://blastedbio.blogspot.co.uk/2013/11/entrez-trouble-with-chimeras.html

ADD COMMENT
3
Entering edit mode
7.8 years ago
Carlos Borroto ★ 2.0k

I was involved in a project where keeping an updated viral database was key to our success. We went the route recommended by Pierre. It was extremely hard. The files linked by Pierre need to be first downloaded and then transformed from Genbank to fasta format, easily doable with any bio*(python, perl, ruby, etc) but painfully slow. You also need to remove redundancy or your results will be extremely noisy.

If I had to start over I would do something smarter. I would keep a list of GIs known to be from viral sequences and use 'blastn' option -gilist with the nt/nr databases provided by NCBI. See http://www.ncbi.nlm.nih.gov/books/NBK1763/. This option limits results to hits matching GIs in the provided list. Keeping such a list updated will definitely be easier than house-keeping a custom blast database.

ADD COMMENT

Login before adding your answer.

Traffic: 2574 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6