Question: Get taxonomy into a CDS from genomic custom database
0
gravatar for Shred
16 months ago by
Shred150
Shred150 wrote:

Guys, I know this question got similarities with older posts here, but I'm stuck on getting a solution.

I've got this situation: I've downloaded from ncbi assembly a bacterial genus specific database made with the nucleotide sequences of genes, using the "CDS from genomic" option. While getting the final .fa file, I've seen there's no taxonomy info into the fasta header of each sequence. What I want to do is to include the taxonomy into the header, just the rank under the genus (so, species and type-strain) using the protein ID of each sequence, in order to have taxonomy info while using "stitle" output option in blastn against this database.

Is there any kind of pre-written script who does this task?

Actually input (from .fa) :

>lcl|QDIX01000078.1_cds_PVV28006.1_1 [locus_tag=DD715_09695] [protein=molecular chaperone DnaK] [protein_id=PVV28006.1] [location=6826..8709] [gbkey=CDS]
 ATGGCACGTGCAGTTGGTATTGATCTGGGTACTACGAATTCCTGCATCGCGACCCTTGAAGGTGGCGAGC
 CCACCGTCATCGTGAACGCCGAAGGCGCGCGCACCACGCCGTCCGTGGTGGCGTTCAGTAAGTCCGGCGA
 GATCCTGGTCGGC

Expected output (taxonomy retrieved by protein_id) :

 >lcl|QDIX01000078.1_cds_PVV28006.1_1 [locus_tag=DD715_09695] [protein=molecular chaperone DnaK] [protein_id=PVV28006.1] [location=6826..8709] [gbkey=CDS] [ Bifidobacterium bifidum ]
 ATGGCACGTGCAGTTGGTATTGATCTGGGTACTACGAATTCCTGCATCGCGACCCTTGAAGGTGGCGAGC
 CCACCGTCATCGTGAACGCCGAAGGCGCGCGCACCACGCCGTCCGTGGTGGCGTTCAGTAAGTCCGGCGA
 GATCCTGGTCGGC
ADD COMMENTlink modified 16 months ago • written 16 months ago by Shred150

post some example data and expected output.

ADD REPLYlink written 16 months ago by cpad011212k

Question updated. I've appended taxonomy at end of the header, but actually is not a priority itself.

ADD REPLYlink written 16 months ago by Shred150

Except species scientific name, i do not see any thing additional taxonomic information in fasta header? Am I missing some thing here?

ADD REPLYlink written 16 months ago by cpad011212k

Yes, what I want actually is scientific name of each record. I want to do that in order to get scientific name using "stitle" parameter option in blastn output, in a database created from this edited fasta file.

ADD REPLYlink modified 16 months ago • written 16 months ago by Shred150

bbmap has a function called: gi2taxid.sh. Use it with prefix=t option so that earlier header is retained and tax id is appended. But the tool needs few downloads. If it is a single species, you can try following: (this appends the hard coded species name to each fasta header):

$ sed '/>/ s/.*/& [ Bifidobacterium bifidum ]/' input.fa

>lcl|QDIX01000078.1_cds_PVV28006.1_1 [locus_tag=DD715_09695] [protein=molecular chaperone DnaK] [protein_id=PVV28006.1] [location=6826..8709] [gbkey=CDS] [ Bifidobacterium bifidum ]
ATGGCACGTGCAGTTGGTATTGATCTGGGTACTACGAATTCCTGCATCGCGACCCTTGAAGGTGGCGAGC
CCACCGTCATCGTGAACGCCGAAGGCGCGCGCACCACGCCGTCCGTGGTGGCGTTCAGTAAGTCCGGCGA
GATCCTGGTCGGC
ADD REPLYlink modified 16 months ago • written 16 months ago by cpad011212k

Thanks, but it's not a single species but a single genus. It is possible to download a taxid file specific to the genus "Bifidobacterium" , instead of downloading the whole taxid file from ncbi?

ADD REPLYlink written 16 months ago by Shred150
1

well, without downloading information, i have a small script to download genus (till genus), using protein IDs from sequence posted above:

$ wget -qO- http://togows.org/entry/ncbi-protein/PVV28006.1/taxonomy | awk -F ";" '{print $NF}'
 Bifidobacterium.
ADD REPLYlink written 16 months ago by cpad011212k

Thanks a lot. Using the service you've linked, I've found a quick way to solve my task.

$ wget -qO- http://togows.org/entry/ncbi-protein/PVV28006.1/definition | sed 's/^.*[//;s/]$//

Using a while read loop, I've used the protein id as a variable, and imported it every time into a different wget query. Later I'll post the whole script below. Anyway, thanks a lot man.

ADD REPLYlink written 16 months ago by Shred150

You can get the same with locus ID as well with NCBI entrez utils:

$ esearch -db protein -query "DD715_09695" | efetch -format docsum | xtract -pattern DocumentSummary -element Title

molecular chaperone DnaK [Bifidobacterium bifidum]
ADD REPLYlink modified 16 months ago • written 16 months ago by cpad011212k
0
gravatar for Shred
16 months ago by
Shred150
Shred150 wrote:

Here's my solution. Actually, it's better using that on Blast results, rather than on the assembled fasta, because it's written in bash, so it will become extremely slow with huge file.

while read line ; do
    PROTEINID=$( echo $line | sed -e 's/.*protein_id\=\(.*\)\]\ \[location.*/\1/' )
    SCIENTNAME=$( wget -qO- "http://togows.org/entry/ncbi-protein/${PROTEINID}/organism" ) 
    echo "$line" "$SCIENTNAME" 
done <$1
ADD COMMENTlink written 16 months ago by Shred150
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1332 users visited in the last hour