Guys, I know this question got similarities with older posts here, but I'm stuck on getting a solution.
I've got this situation: I've downloaded from ncbi assembly a bacterial genus specific database made with the nucleotide sequences of genes, using the "CDS from genomic" option. While getting the final .fa file, I've seen there's no taxonomy info into the fasta header of each sequence. What I want to do is to include the taxonomy into the header, just the rank under the genus (so, species and type-strain) using the protein ID of each sequence, in order to have taxonomy info while using "stitle" output option in blastn against this database.
Is there any kind of pre-written script who does this task?
Actually input (from .fa) :
>lcl|QDIX01000078.1_cds_PVV28006.1_1 [locus_tag=DD715_09695] [protein=molecular chaperone DnaK] [protein_id=PVV28006.1] [location=6826..8709] [gbkey=CDS]
ATGGCACGTGCAGTTGGTATTGATCTGGGTACTACGAATTCCTGCATCGCGACCCTTGAAGGTGGCGAGC
CCACCGTCATCGTGAACGCCGAAGGCGCGCGCACCACGCCGTCCGTGGTGGCGTTCAGTAAGTCCGGCGA
GATCCTGGTCGGC
Expected output (taxonomy retrieved by protein_id) :
>lcl|QDIX01000078.1_cds_PVV28006.1_1 [locus_tag=DD715_09695] [protein=molecular chaperone DnaK] [protein_id=PVV28006.1] [location=6826..8709] [gbkey=CDS] [ Bifidobacterium bifidum ]
ATGGCACGTGCAGTTGGTATTGATCTGGGTACTACGAATTCCTGCATCGCGACCCTTGAAGGTGGCGAGC
CCACCGTCATCGTGAACGCCGAAGGCGCGCGCACCACGCCGTCCGTGGTGGCGTTCAGTAAGTCCGGCGA
GATCCTGGTCGGC
post some example data and expected output.
Question updated. I've appended taxonomy at end of the header, but actually is not a priority itself.
Except species scientific name, i do not see any thing additional taxonomic information in fasta header? Am I missing some thing here?
Yes, what I want actually is scientific name of each record. I want to do that in order to get scientific name using "stitle" parameter option in blastn output, in a database created from this edited fasta file.
bbmap has a function called:
gi2taxid.sh
. Use it withprefix=t
option so that earlier header is retained and tax id is appended. But the tool needs few downloads. If it is a single species, you can try following: (this appends the hard coded species name to each fasta header):Thanks, but it's not a single species but a single genus. It is possible to download a taxid file specific to the genus "Bifidobacterium" , instead of downloading the whole taxid file from ncbi?
well, without downloading information, i have a small script to download genus (till genus), using protein IDs from sequence posted above:
Thanks a lot. Using the service you've linked, I've found a quick way to solve my task.
Using a while read loop, I've used the protein id as a variable, and imported it every time into a different wget query. Later I'll post the whole script below. Anyway, thanks a lot man.
You can get the same with locus ID as well with NCBI entrez utils: