Guys, I know this question got similarities with older posts here, but I'm stuck on getting a solution.
I've got this situation: I've downloaded from ncbi assembly a bacterial genus specific database made with the nucleotide sequences of genes, using the "CDS from genomic" option. While getting the final .fa file, I've seen there's no taxonomy info into the fasta header of each sequence. What I want to do is to include the taxonomy into the header, just the rank under the genus (so, species and type-strain) using the protein ID of each sequence, in order to have taxonomy info while using "stitle" output option in blastn against this database.
Is there any kind of pre-written script who does this task?
Actually input (from .fa) :
>lcl|QDIX01000078.1_cds_PVV28006.1_1 [locus_tag=DD715_09695] [protein=molecular chaperone DnaK] [protein_id=PVV28006.1] [location=6826..8709] [gbkey=CDS] ATGGCACGTGCAGTTGGTATTGATCTGGGTACTACGAATTCCTGCATCGCGACCCTTGAAGGTGGCGAGC CCACCGTCATCGTGAACGCCGAAGGCGCGCGCACCACGCCGTCCGTGGTGGCGTTCAGTAAGTCCGGCGA GATCCTGGTCGGC
Expected output (taxonomy retrieved by protein_id) :
>lcl|QDIX01000078.1_cds_PVV28006.1_1 [locus_tag=DD715_09695] [protein=molecular chaperone DnaK] [protein_id=PVV28006.1] [location=6826..8709] [gbkey=CDS] [ Bifidobacterium bifidum ] ATGGCACGTGCAGTTGGTATTGATCTGGGTACTACGAATTCCTGCATCGCGACCCTTGAAGGTGGCGAGC CCACCGTCATCGTGAACGCCGAAGGCGCGCGCACCACGCCGTCCGTGGTGGCGTTCAGTAAGTCCGGCGA GATCCTGGTCGGC