Get taxonomy into a CDS from genomic custom database
1
0
Entering edit mode
5.9 years ago
Shred ★ 1.4k

Guys, I know this question got similarities with older posts here, but I'm stuck on getting a solution.

I've got this situation: I've downloaded from ncbi assembly a bacterial genus specific database made with the nucleotide sequences of genes, using the "CDS from genomic" option. While getting the final .fa file, I've seen there's no taxonomy info into the fasta header of each sequence. What I want to do is to include the taxonomy into the header, just the rank under the genus (so, species and type-strain) using the protein ID of each sequence, in order to have taxonomy info while using "stitle" output option in blastn against this database.

Is there any kind of pre-written script who does this task?

Actually input (from .fa) :

>lcl|QDIX01000078.1_cds_PVV28006.1_1 [locus_tag=DD715_09695] [protein=molecular chaperone DnaK] [protein_id=PVV28006.1] [location=6826..8709] [gbkey=CDS]
 ATGGCACGTGCAGTTGGTATTGATCTGGGTACTACGAATTCCTGCATCGCGACCCTTGAAGGTGGCGAGC
 CCACCGTCATCGTGAACGCCGAAGGCGCGCGCACCACGCCGTCCGTGGTGGCGTTCAGTAAGTCCGGCGA
 GATCCTGGTCGGC

Expected output (taxonomy retrieved by protein_id) :

 >lcl|QDIX01000078.1_cds_PVV28006.1_1 [locus_tag=DD715_09695] [protein=molecular chaperone DnaK] [protein_id=PVV28006.1] [location=6826..8709] [gbkey=CDS] [ Bifidobacterium bifidum ]
 ATGGCACGTGCAGTTGGTATTGATCTGGGTACTACGAATTCCTGCATCGCGACCCTTGAAGGTGGCGAGC
 CCACCGTCATCGTGAACGCCGAAGGCGCGCGCACCACGCCGTCCGTGGTGGCGTTCAGTAAGTCCGGCGA
 GATCCTGGTCGGC
ncbi taxonomy • 1.7k views
ADD COMMENT
0
Entering edit mode

post some example data and expected output.

ADD REPLY
0
Entering edit mode

Question updated. I've appended taxonomy at end of the header, but actually is not a priority itself.

ADD REPLY
0
Entering edit mode

Except species scientific name, i do not see any thing additional taxonomic information in fasta header? Am I missing some thing here?

ADD REPLY
0
Entering edit mode

Yes, what I want actually is scientific name of each record. I want to do that in order to get scientific name using "stitle" parameter option in blastn output, in a database created from this edited fasta file.

ADD REPLY
0
Entering edit mode

bbmap has a function called: gi2taxid.sh. Use it with prefix=t option so that earlier header is retained and tax id is appended. But the tool needs few downloads. If it is a single species, you can try following: (this appends the hard coded species name to each fasta header):

$ sed '/>/ s/.*/& [ Bifidobacterium bifidum ]/' input.fa

>lcl|QDIX01000078.1_cds_PVV28006.1_1 [locus_tag=DD715_09695] [protein=molecular chaperone DnaK] [protein_id=PVV28006.1] [location=6826..8709] [gbkey=CDS] [ Bifidobacterium bifidum ]
ATGGCACGTGCAGTTGGTATTGATCTGGGTACTACGAATTCCTGCATCGCGACCCTTGAAGGTGGCGAGC
CCACCGTCATCGTGAACGCCGAAGGCGCGCGCACCACGCCGTCCGTGGTGGCGTTCAGTAAGTCCGGCGA
GATCCTGGTCGGC
ADD REPLY
0
Entering edit mode

Thanks, but it's not a single species but a single genus. It is possible to download a taxid file specific to the genus "Bifidobacterium" , instead of downloading the whole taxid file from ncbi?

ADD REPLY
1
Entering edit mode

well, without downloading information, i have a small script to download genus (till genus), using protein IDs from sequence posted above:

$ wget -qO- http://togows.org/entry/ncbi-protein/PVV28006.1/taxonomy | awk -F ";" '{print $NF}'
 Bifidobacterium.
ADD REPLY
0
Entering edit mode

Thanks a lot. Using the service you've linked, I've found a quick way to solve my task.

$ wget -qO- http://togows.org/entry/ncbi-protein/PVV28006.1/definition | sed 's/^.*[//;s/]$//

Using a while read loop, I've used the protein id as a variable, and imported it every time into a different wget query. Later I'll post the whole script below. Anyway, thanks a lot man.

ADD REPLY
0
Entering edit mode

You can get the same with locus ID as well with NCBI entrez utils:

$ esearch -db protein -query "DD715_09695" | efetch -format docsum | xtract -pattern DocumentSummary -element Title

molecular chaperone DnaK [Bifidobacterium bifidum]
ADD REPLY
0
Entering edit mode
5.9 years ago
Shred ★ 1.4k

Here's my solution. Actually, it's better using that on Blast results, rather than on the assembled fasta, because it's written in bash, so it will become extremely slow with huge file.

while read line ; do
    PROTEINID=$( echo $line | sed -e 's/.*protein_id\=\(.*\)\]\ \[location.*/\1/' )
    SCIENTNAME=$( wget -qO- "http://togows.org/entry/ncbi-protein/${PROTEINID}/organism" ) 
    echo "$line" "$SCIENTNAME" 
done <$1
ADD COMMENT

Login before adding your answer.

Traffic: 2415 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6