Program Or Script To Go From Partial Protein Sequence To Dna
3
6
Entering edit mode
11.0 years ago
Niek De Klein ★ 2.6k

Hi,

I got partial protein sequences (one domain from bigger proteins) and I want to know the corresponding DNA sequence of that part of the protein. I know the proteins GI number so I can get the DNA origin of the complete protein sequence from NCBI.

So for example the partial sequence is:

GKYVRYTPEQVEALERLYHDCPKPSSIRRQQLIRECPILSNIEPKQIKVWFQNRRCREKQRKEASRLQAVNRKL
TAMNKLLMEENDRLQKQVSQLVH


The whole protein is:

MAMSCKDGKLGCLDNGKYVRYTPEQVEALERLYHDCPKPSSIRRQQLIRECPILSNIEPKQIKVWFQNRRCRE
KQRKEASRLQAVNRKLTAMNKLLMEENDRLQKQVSQLVHENSYFRQHTPNPSLPAKDTSCESVVTSGQHQLA
SQNPQRDASPAGLLSIAEETLAEFLSKATGTAVEWVQMPGMKPGPDSIGIIAISHGCTGVAARACGLVGLEPTR
VAEIVKDRPSWFRECRAVEVMNVLPTANGGTVELLYMQLYA


And the dna sequence on NCBI is (cut both protein and dna off so they aren't exactly the same as on ncbi):

ACATCTCTTCTTCATCCTCTCTTCTACTTTCCTCTTTCCTCTTTCCTTCTTCGAATAAATTTCTAGGGTT
TTTCTTTTCTCTAAAGTTTTCATTTTTATTTCAATGAGAGCTCGAAGAAGGAGAATATGGGTTTGAGAAC
TGATAATATTATGGCTTCGTTTCGAGGTGGAATCGGGGTTTCTAATGGCTGAGTCAACTCGGTGATTCTG
TGTTATAGTCACGAGCAAATATAAAAAAGTTTGTAACTTTCTTGTTTTTTTAGGTGTGTGTGTTCAGAGA
AAAGGTCGAATCTTTTTTCGGTGTTTGTAAAAGGGAAAGTTGTAATCTTAAAGTCTGTTTTTCTTTCTTG
TGTTTTGGTATTTAGCTCATAAAAGCCGAGGAGTAATATAAAGGATAGGTTTTGTCTTTGTGTGCCCTTT
TGAGATTGCATGAAGAAAAAAAGCCTCTAGTGTGTTTTGAAGGAAACAGAATTCGATATTTATGCGGTAA
TGTGATTTGTGAAGCTACTCCAAGTGCTTAGGATTTGAGATGGCTTAGATTTGGTAGTTGTTCAAGCTGT
GGAGTTTGTGGTGGACTAAGAAGCTCTCTGTCTCCTTTGTTTAGTATGTTGTGGTTATCTTCTGTTTAGA
AGGATTTAGTTATTCATCTGGAGGGGGTAGTAGGGTCATTTGTGAGATTCTGTGATTGTGAAATAAGAAG
AGTTTTGCTGAGGAGTAATGGCAATGTCTTGCAAGGATGGTAAGTTGGGATGTTTGGATAATGGGAAGTA
TGTGAGGTATACACCTGAACAAGTTGAAGCACTTGAGAGGCTTTATCATGACTGTCCTAAACCGAGTTCT
ATTCGCCGTCAGCAGTTGATCAGAGAGTGTCCTATTCTCTCTAACATTGAGCCTAAACAGATCAAAGTGT
GGTTTCAGAACCGAAGATGTAGAGAGAAACAAAGGAAAGAGGCTTCACGGCTTCAAGCTGTGAATCGGAA
GTTGACGGCAATGAACAAGCTCTTGATGGAGGAGAATGACAGGTTGCAGAAGCAAGTGTCACAGCTGGTC
CATGAAAACAGCTACTTCCGTCAACATACTCCAAATCCTTCACTCCCAGCTAAAGACACAAGCTGTGAAT
CGGTGGTGACGAGTGGTCAGCACCAATTGGCATCTCAAAATCCTCAGAGAGATGCTAGTCCTGCAGGACT
TTTGTCCATTGCAGAAGAAACTTTAGCAGAGTTTCTTTCAAAGGCAACTGGAACCGCTGTTGAGTGGGTT
CAGATGCCTGGAATGAAGCCTGGTCCGGATTCCATTGGAATCATCGCTATTTC


And I want to know what the DNA sequence is for the partial protein sequence.

I think that know how I could program this but I was wondering if you know an existing script or program that already does that?

Thanks, Niek

translation protein translation coordinates • 3.7k views
7
Entering edit mode
11.0 years ago
Neilfws 49k

There are lots of ways that you could address this problem, depending on the input files and their format.

Given that you have, as specified in the question, peptide and nucleotide sequences in FASTA format, then bl2seq is a good starting point. You can run:

bl2seq -j peptide.fa -i nucleotide.fa -p blastx -o blast.out


The query line in the HSP will then have coordinates in nucleotide position:

Query: 763  GKYVRYTPEQVEALERLYHDCPKPSSIRRQQLIRECPILSNIEPKQIKVWFQNRRCREKQ 942
GKYVRYTPEQVEALERLYHDCPKPSSIRRQQLIRECPILSNIEPKQIKVWFQNRRCREKQ
Sbjct: 1    GKYVRYTPEQVEALERLYHDCPKPSSIRRQQLIRECPILSNIEPKQIKVWFQNRRCREKQ 60

Query: 943  RKEASRLQAVNRKLTAMNKLLMEENDRLQKQVSQLVH 1053
RKEASRLQAVNRKLTAMNKLLMEENDRLQKQVSQLVH
Sbjct: 61   RKEASRLQAVNRKLTAMNKLLMEENDRLQKQVSQLVH 97


You could then parse the BLAST report using e.g. Bioperl Search::IO to get the query HSP start/end (763 - 1053). Then Bioperl SeqIO could be used to get the DNA sequence, something like this:

use strict;
use Bio::SeqIO;

my $nuc = Bio::SeqIO->new(-file => "nucleotide.fa", -format => "fasta"); print$nuc->next_seq->subseq(763,1053), "\n";


Using Bioperl (or the Bio* library of your choice) it's quite easy to write a single script that reads the sequences, runs bl2seq, parses the result and extracts the DNA subsequence.

0
Entering edit mode

Do you know if bl2seq has a different name in blast 2.2.24+, or if they removed it?

0
Entering edit mode

I'm yet to upgrade to 2.2.24, so I don't know for sure. As far as I know it is still in the package. If not, the older versions are still available to download.

4
Entering edit mode
11.0 years ago
Mary 11k

You want a reverse translator? Or maybe I'm not understanding. If you have the DNA record you could run BLAST 2 seq (bl2seq) perhaps?

But if a reverse translator is what you need, here's one: http://www.bioinformatics.org/sms2/rev_trans.html Just be sure to set the codon usage table appropriately. And, of course, watch out for splicing. Also will be more or less of an issue depending on your species.

There are also more complex back translation strategies like PATH.

But I have a feeling I'm not understanding the issue.

1
Entering edit mode

Your first suggestion of bl2seq is a good starting point. Essentially, this is a coordinate mapping problem (peptide -> nucleotide).

0
Entering edit mode

Ah, that would make a good tag: coordinate-mapping. I think that's a frequent issue. Usually people I encounter want it from assembly to assembly, but it seems like a useful item. I'll add that.

3
Entering edit mode
11.0 years ago
Rm 8.1k

The Wise2 from EMBL-EBI aligns/compares a protein sequence to the corresponding DNA sequence. or to a genomics DNA.

http://www.ebi.ac.uk/Tools/Wise2/index.html

In your case: Run wise2 using the "partial Protein" sequence and "full DNA" sequence to get the DNA sequence aligning only to the partial protein sequence.