Question: How to create a Blast database of viruses ?
2
gravatar for bell
5.5 years ago by
bell20
bell20 wrote:

Hi all,

For a metagenomic project a want to make a blast database of viruses. I dont want to blast my reads on the entire nt database. But I dont know how to make it. Downloading the result of a query like viruses[organism] from the nucleotide database of NCBI is impossible, due to the weight of the data. Maybe there is a solution using the taxononomy files for extract sequences of the Fasta nt file available on the ncbi ftp ?

So could anyone give me a solution ?

Thank you very much !

blast sequence • 6.3k views
ADD COMMENTlink modified 5.5 years ago by Carlos Borroto1.8k • written 5.5 years ago by bell20
4
gravatar for Pierre Lindenbaum
5.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum123k wrote:

Download the VRL division of genbank ftp://ftp.ncbi.nih.gov/genbank/gbvrl*.seq.gz and index it with blast

ADD COMMENTlink written 5.5 years ago by Pierre Lindenbaum123k

Hi Pierre, I am doing something similar at the moment, and this looks like a very good solution. Maybe we should mention that the data is in genbank format (obviously) and needs to be converted to fasta before making a blastdb. However, when using BioPerl SeqIO to convert the fasta headers look like this:

>AB000048 Feline panleukopenia virus gene for nonstructural protein 1, complete cds, isolate: 483. 

So, no gi's here, but they would be needed to assign taxids for metagenomics, any quick fix to keep the gi? 

ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Michael Dondrup46k
2

@Michael use awk ?

 

$ curl -s "ftp://ftp.ncbi.nih.gov/genbank/gbvrl27.seq.gz" | gunzip -c  | awk -f jeter.awk

>gi:422089830|Hepatitis C virus isolate V2401 NS5AB replicase gene, partial cds.
tggattaacgaggactgctccacgccatgctccggctcgtggctaaaggatgtttgggac
tggatatgcacggtgctgtctgatttcagaacctggctccagtccaagctcctgccgcgg
ytaccgggagtccctttcttctcgtgtcaacgtggatataagggagtctggcggggygac
ggcatcatgcaaaccacctgttcatgtggggcacagatcaccggacatgtcaaaaacggc
ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Pierre Lindenbaum123k

+1 for awk regex tricks :-)

ADD REPLYlink written 4.0 years ago by biocyberman800

what is jeter.awk ?!

ADD REPLYlink written 3.8 years ago by Quak300

it's the awk script above. jeter in french means "trashed" (a name I use for temporary files)

ADD REPLYlink written 3.8 years ago by Pierre Lindenbaum123k

Hi, it doesn't seem to work with some sequences, i mean some sequences after the scritp just appear empty..

ADD REPLYlink written 2.6 years ago by luisitosrt0

I am just wondering what the correct number of entries is:

  • the gbvrl files converted to fasta files contain 1584206 entries
  • the ncbi query 'Viruses[Organism] NOT cellular organisms[ORGN] NOT AC_000001:AC_999999[pacc]' yields 1741019
  • when I download this query via efetch I retrieve only 1737392 entries 
ADD REPLYlink modified 5.5 years ago • written 5.5 years ago by Michael Dondrup46k
3
gravatar for Peter
5.5 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

Related to Pierre's answer, if you want complete virus genomes, there are FASTA files available at ftp://ftp.ncbi.nih.gov/genomes/Viruses/

You can download complete genomes via NCBI Entrez but that is more problematic, see http://blastedbio.blogspot.co.uk/2013/11/entrez-trouble-with-chimeras.html

ADD COMMENTlink written 5.5 years ago by Peter5.8k
3
gravatar for Carlos Borroto
5.5 years ago by
Carlos Borroto1.8k
Washington Metropolitan Area
Carlos Borroto1.8k wrote:

I was involved in a project where keeping an updated viral database was key to our success. We went the route recommended by Pierre. It was extremely hard. The files linked by Pierre need to be first downloaded and then transformed from Genbank to fasta format, easily doable with any bio*(python, perl, ruby, etc) but painfully slow. You also need to remove redundancy or your results will be extremely noisy.

If I had to start over I would do something smarter. I would keep a list of GIs known to be from viral sequences and use 'blastn' option '-gilist' with the nt/nr databases provided by NCBI. See http://www.ncbi.nlm.nih.gov/books/NBK1763/. This option limits results to hits matching GIs in the provided list. Keeping such a list updated will defenitely be easier than house-keeping a custom blast database.

ADD COMMENTlink modified 5.5 years ago • written 5.5 years ago by Carlos Borroto1.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2068 users visited in the last hour