Many thanks for suggestions!
Today I've tried to make script for the conversion of the spliced transcripts to the protein sequence using above bio pythone code. Here I passed def function for the conversion of the RNA to protein (It works good).
This script works good for short sequences provided in first example but made wrong results for the long chain with many introns like in the below example. Please keep in mind that the "exons_seq" variable which consist of list of only sequences of every exons also include some empty elements which might be source of error.
from Bio import SeqIO
def translate_dna(sequence):
gencode = { 'ATA':'I',
'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T',
'ACT':'T', 'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S',
'AGT':'S', 'AGA':'R', 'AGG':'R', 'CTA':'L', 'CTC':'L', 'CTG':'L',
'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 'CAC':'H',
'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R',
'CGT':'R', 'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A',
'GCC':'A', 'GCG':'A', 'GCT':'A', 'GAC':'D', 'GAT':'D', 'GAA':'E',
'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 'TCA':'S',
'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L',
'TTG':'L', 'TAC':'Y', 'TAT':'Y', 'TAA':'', 'TAG':'', 'TGC':'C',
'TGT':'C', 'TGA':'', 'TGG':'W', }
#print sequence
proteinseq = ''
for n in range(0,len(sequence),3):
if gencode.has_key(sequence[n:n+3]) == True:
proteinseq += gencode[sequence[n:n+3]]
print proteinseq
#slicing using biopython- for testing
handle = open("data.txt", "r")
lines = list(SeqIO.parse(handle, "fasta"))
chain = lines[0]
lines = lines[1:]
last_start=0
exons = []
for line in lines :
start = chain.seq.find(line.seq)
if(start > 0) :
print("Intron %s has been detected in %i position" % (line.seq, start+1))
exons.append([chain.seq[last_start:start]])
last_start = start + len(line.seq)
exons.append([chain.seq[last_start:]])
otvet2=""
exons_seq = [' '.join([c.tostring() for c in lst]) for lst in exons]
xz="".join([otvet2+s for s in exons_seq])
translate_dna(xz)
here you can see real data.txt
>Rosalind_5942
ATGCGAGGTGTCGGTAACCCAGTCCGATGAAGGAGGAGCCGATAGATCTTCCCCGGGATT
ACATAAGGTTCTCTCCGGGACCATTCAAGTCCGCATTAGCGCGAGGCGGCTTACAACGTC
AAAGAATTGGTTACTTCGCTTTGGAAGCAATTCTGTTGCATGTGCCCCTGGTAATTAGGA
AAGTCTCTGAGGCCGGGCAATGCGACCAGCGCGATGATGGAAGATCATCCACCACTCTTG
GGATTTCAGAAGGATCAAGGTAACCGGGTTATGCGTTTATGTGTTGGTAATGCGTGTAAC
CGTCGGCCATGGCTCTAACATTGAAGAGGAATAGGGCTAGGCTGGCCACACTAGCTGTCG
AAAGTATTGTTGCTCTTGTCGGTCCCTGATAAGCCTCCATTCGATGGATAGTGCGACGTG
TTGGCAGCAACTACTCACCTAACTATCTACCTTGACGAGATAGATGTTTTTTGTTCTCCC
GCATCCTACAGGCGCACTTTGGTACGCAGTCTCAGCACTATTGTCCACAGCAGTGTAGTA
TATTGCCATGCAAAAGGTTCATGTCGATAGAGTCTATGTTAGGTCTGACAACCTGCCAAT
GTGGAGATTCTTTGCTTCTACGGCGATAAGTGCCAAATAGTTGGGCTAAACTGGCTTAGT
TTAGCCGAAGGATTGTCTGAGCTTTCTACTCGGTCAAAGGCGATGGGACCGTTCCCTTTG
ACGTCCGGTCCACTAGACAGAACCACGATGCCTACAACCCCTTGAGTACAATCAGCACAT
GATTCTGTCTTCTCCGCAGTTGGCTTTTAACAATCTTAGTAGTCCTGGAAGGCAATATTG
TATGGAGTCTTTAACGAGCGGGATGGCTCCTTGCTTCTTCTTGACGTCGAGTGTTACGTT
TGTTGTAGACTTGTTGCGAGTTGGGAATTCAATAGGGTACGTATATAG
>Rosalind_2822
TCCTACAGGCGCACTTTGGT
>Rosalind_8608
CTAAACTGGCTTAGTTTAGCCGAAGGAT
>Rosalind_5876
GAAGGCAATATTGTATGGAGTCTTT
>Rosalind_2322
TCCACCACTCTTGGGATTTCAGAAG
>Rosalind_9365
ACAACGTCAAAGAATTG
>Rosalind_9254
TGATTCTGTCTTCTC
>Rosalind_6250
TTCATGTCGATAGAGTCTATGTTAGGTCTGACAACCTGCCAATGTG
>Rosalind_4863
CGATGGGACCGTTCCCTTTGAC
>Rosalind_2573
TGGTAATTAGGAAAGTCTCTGAGGCCGGGCAATGC
>Rosalind_9081
TGTTGGCAGCAACTACT
>Rosalind_9138
TGGCTCTAACA
>Rosalind_8270
TTGCTTCTTCTTGACGTCGAGTGT
>Rosalind_8468
TGAAGGAGGAGCCGATAGATCTTCCCCGGGATTA
>Rosalind_9912
GTTGCTCTTGTCGGTCCCTGATAAGCCTCCAT
Firstly, could you delete this and just update your original post with it (or just make it a comment on my answer)? Secondly, when you say "made wrong results for the long chain with many introns like in the below", what exactly do you mean? It would help to see the output you obtain and the output you expect (yes, I could just run your example, but it's best to make everyone's life as easy as possible when asking for help).
Hi, this was the part of this problem http://rosalind.info/problems/splc/ It's works good with the sample output but when rosalind gave me real data the result (the protein sequence without exons) was wrong - unfortunately it could not possible to see right answer for each dataset
You mean the protein sequence after the introns were removed, rather than without exons. I updated my original answer with a version that would work with this and the previous dataset. Having said that, if you're trying to learn then you really should have seen what was amiss from the "intron %s has been detected in ..." output.
Many thanks again for suggestions! General algorithm is clean for me. I'll try to make such task without bio python