Pipeline Or Tool For Going From Assembled Genome Fasta & Gff Files Directly To Blast Database?
1
2
Entering edit mode
8.2 years ago
Josh Herr 5.7k

I'm hoping this doesn't come off as a lazy question, but I am wondering if anyone is aware of a tool or pipeline out there that directly takes reference genome (supercontigs, contigs, scaffold, what have you) and an annotation file (GFF, etc.) and goes "directly" to formatted BLAST database? I've searched around and haven't been able to find anything able to do this.

I'm in need of taking a LOT of microbial genomes and creating a BLAST database from all of the coding sequences. For this in silico experiment I'm benchmarking against NCBI databases so I want the contents to be different and can't just use other databases. I'm looking for a pipeline for specifically re-running the database creation portion over and over again.

This is basically what I am looking for would be script -input *.fasta -annotation *.gff -output blast.database. I know I can do this using numerous tools (formatdb, etc.) by stitching them together in my own pipeline, but I wanted to see if someone had already developed a tool before I invested any time -- any input is appreciated!

fasta fastq blast • 5.1k views
3
Entering edit mode

Note formatdb is replaced by makeblastdb in BLAST+ and using either tool is it trivial to make a nucleotide BLAST database of your assembly FASTA file(s).

Could you clarify if you want a gene nucleotide BLAST database (which means first processing the annotation to make a FASTA file the gene coding sequences), or a gene protein BLAST database (which also means translating into amino acids)?

1
Entering edit mode

Thanks for the correction on the BLAST+ commands -- you saved me some trouble!

I'm hoping to have the capability of both nucleotide and protein for the analysis -- so I would need both. I think I might have to make my own pipeline.

Thanks for the input +1.

1
Entering edit mode
8.2 years ago

These 2 or 3 steps could easily be put in a script:

1. 'Cut out' desired sequences from Fasta file using coordinates in GFF
2. Translate fasta running through EMBOSS Transeq, if protein database
3. Run makeblastdb

You could use or modify a script like the example http://www.bioperl.org/wiki/Getting_Fasta_sequences_from_a_GFF for step 1 (I would replace the simplistic parser with Bio::Tools::GFF though for parsing the GFF as long as I can for more control, e.g. which feature type to choose, which attributes to put in the fasta header) to make a Fasta file from your assembly + coordinates in the GFF file. In fact the step of making a Fasta file of the coding sequences is the only step where there is a tiny bit of tweaking necessary. One could also use R+GenomicRanges and create Views on the Fasta sequence and write them as Fasta. I am sure there are python and ruby and biohaskel scripts for this as well.

When creating the AA database, you might need more control over the type of feature used from the annotation. The annotation should contain CDS and only those should be used. If CDS are discontinuous, the resulting sequences must be joined. That is something, the above script doesn't do. With bacteria you might be lucky and have all CDS continuous, then you should be fine.

0
Entering edit mode

Thanks for your input, Michael Dondrup! I'm still working out my algorithm for flexibility for nucleotide vs. protein database creation, so your comments are greatly appreciated and your links are novel for me. Thanks again!