I have a single fasta file containing 3136 partial 18S rRNA gene sequences (on average 235 nucleotides long and never longer than 260 or shorter than 185 nt) for which I would like to get the top 10 blast hits against the nt database in table format. Preferably I would also like to get their source organism taxonomy (GB field) in the same table, but let's consider that optional for now.
I am contemplating what would be the best strategy for this.
I would prefer not to have to download the entire nr database to my computer.
Therefore I consider currently two strategies:
- Use the NCBI BLAST+ suite as described here: http://www.ncbi.nlm.nih.gov/books/NBK1763/ under BLAST+ remote blast, the issue is that here it is described for only a single sequence submission and for my number of sequences quite the number of RID's get automatically generated, which I do not want to format manually afterwards. But I am afraid that there's no way to avoid this?
- Alternatively, I could use bioperl or biopython to run a remoteblast loop and try to format the output appropriately
Which of these two strategies would be most efficient?
Any pointers are warmly welcome...
edit: the sequences are fungal 18S reads of which consensus sequences for OTU's were obtained with mothur. Half of the original dataset could not be classified deeper than "eukaryotes" with SILVA or RDP. Therefore, I am looking to the closest BLAST matches in the nt database, maybe I could also consider the env_nt database but first I'd like to check out the nt.