Note: I am very new to bioinformatics!
I am on a Windows 11 machine using BLAST+ 2.15.0 to run blastp queries against a custom database of shotgun metagenomic data from this website: http://gigadb.org/dataset/100842
I am querying the 02_AnaerobicDigestion_GeneCatalog_gene.pep.fa file using blastp, and the results returned (to a .txt or .xml file) look like this:
I want to know what bacterial strain/species is associated with each hit, but all the subjects have an AD_gene_#### identifier (from the metagenome sequencing) instead of any kind of species/strain identifier.
I know that I should be able to collect protein sequences from the blastp results into a file, but I do not know how to do this.
I would then need to blastp these sequences against the non-redundant protein database and write a file that contains information about the taxonomy of the the top blastp hit.
I don't need the amino acid sequence at that point, but just some kind of strain identifier that I can use to create a list of bacterial "species."
In summary, I want a list of bacterial species that contain a homolog of a protein of interest from a shotgun metagenome dataset.
I'm not sure how to get the output that I'm looking for and would appreciate any help!