Question: Pipeline Or Tool For Going From Assembled Genome Fasta & Gff Files Directly To Blast Database?
gravatar for Josh Herr
7.1 years ago by
Josh Herr5.7k
University of Nebraska
Josh Herr5.7k wrote:

I'm hoping this doesn't come off as a lazy question, but I am wondering if anyone is aware of a tool or pipeline out there that directly takes reference genome (supercontigs, contigs, scaffold, what have you) and an annotation file (GFF, etc.) and goes "directly" to formatted BLAST database? I've searched around and haven't been able to find anything able to do this.

I'm in need of taking a LOT of microbial genomes and creating a BLAST database from all of the coding sequences. For this in silico experiment I'm benchmarking against NCBI databases so I want the contents to be different and can't just use other databases. I'm looking for a pipeline for specifically re-running the database creation portion over and over again.

This is basically what I am looking for would be script -input *.fasta -annotation *.gff -output blast.database. I know I can do this using numerous tools (formatdb, etc.) by stitching them together in my own pipeline, but I wanted to see if someone had already developed a tool before I invested any time -- any input is appreciated!

fasta fastq blast • 4.7k views
ADD COMMENTlink modified 7.1 years ago by Michael Dondrup47k • written 7.1 years ago by Josh Herr5.7k

Note formatdb is replaced by makeblastdb in BLAST+ and using either tool is it trivial to make a nucleotide BLAST database of your assembly FASTA file(s).

Could you clarify if you want a gene nucleotide BLAST database (which means first processing the annotation to make a FASTA file the gene coding sequences), or a gene protein BLAST database (which also means translating into amino acids)?

ADD REPLYlink written 7.1 years ago by Peter5.8k

Thanks for the correction on the BLAST+ commands -- you saved me some trouble!

I'm hoping to have the capability of both nucleotide and protein for the analysis -- so I would need both. I think I might have to make my own pipeline.

Thanks for the input +1.

ADD REPLYlink written 7.1 years ago by Josh Herr5.7k
gravatar for Michael Dondrup
7.1 years ago by
Bergen, Norway
Michael Dondrup47k wrote:

These 2 or 3 steps could easily be put in a script:

  1. 'Cut out' desired sequences from Fasta file using coordinates in GFF
  2. Translate fasta running through EMBOSS Transeq, if protein database
  3. Run makeblastdb

You could use or modify a script like the example for step 1 (I would replace the simplistic parser with Bio::Tools::GFF though for parsing the GFF as long as I can for more control, e.g. which feature type to choose, which attributes to put in the fasta header) to make a Fasta file from your assembly + coordinates in the GFF file. In fact the step of making a Fasta file of the coding sequences is the only step where there is a tiny bit of tweaking necessary. One could also use R+GenomicRanges and create Views on the Fasta sequence and write them as Fasta. I am sure there are python and ruby and biohaskel scripts for this as well.

When creating the AA database, you might need more control over the type of feature used from the annotation. The annotation should contain CDS and only those should be used. If CDS are discontinuous, the resulting sequences must be joined. That is something, the above script doesn't do. With bacteria you might be lucky and have all CDS continuous, then you should be fine.

ADD COMMENTlink written 7.1 years ago by Michael Dondrup47k

Thanks for your input, Michael Dondrup! I'm still working out my algorithm for flexibility for nucleotide vs. protein database creation, so your comments are greatly appreciated and your links are novel for me. Thanks again!

ADD REPLYlink modified 7.1 years ago • written 7.1 years ago by Josh Herr5.7k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1247 users visited in the last hour