For a metagenomic project a want to make a blast database of viruses. I dont want to blast my reads on the entire nt database. But I dont know how to make it. Downloading the result of a query like viruses[organism] from the nucleotide database of NCBI is impossible, due to the weight of the data. Maybe there is a solution using the taxononomy files for extract sequences of the Fasta nt file available on the ncbi ftp ?
So could anyone give me a solution ?
Thank you very much !
Hi Pierre, I am doing something similar at the moment, and this looks like a very good solution. Maybe we should mention that the data is in genbank format (obviously) and needs to be converted to fasta before making a blastdb. However, when using BioPerl SeqIO to convert the fasta headers look like this:
So, no gi's here, but they would be needed to assign taxids for metagenomics, any quick fix to keep the gi?
@Michael use awk?
+1 for awk regex tricks :-)
what is jeter.awk ?!
it's the awk script above. jeter in french means "trashed" (a name I use for temporary files)
Hi, it doesn't seem to work with some sequences, i mean some sequences after the scritp just appear empty..
I am just wondering what the correct number of entries is: