Translate Dna To Protein Using Transeq
1
1
Entering edit mode
10.6 years ago
MAPK ★ 2.1k

I have a couple hundred DNA sequences to translate to protein. I used transeq from EMBOSS which is quite simple except that I was not able to get the translated orf with just the start amino acid (methionine) and stop aa. In this example below, >comp2_seq1_2 is the best orf i want to select. How do i set parameters in transeq so that I only get MEIKDLADLYGDELKLTKLIRKSSA RAQEIAKRQELDSSDGQIIDDHDQFYKDHKLLLLLFRILGVMPIERGKIGRITFSWKSIP MIYAYVFYAVMTVIVVFVGIERVDILLNKSKKFDEYIYSIIFIIFLVPHFWIPFVQKDID NFCTGYIIAHYRRLWLELSELLQSIGNAYARTYSTYSLFMITNITVATYGFISEIMEHGI TFSFKEMGLIVASAYCMVLLYIYCDCSHKASDNIALRIQRSLIEIDLTTINLDTGKEIDM FLTAIRLNPPTVSLQGYSDVDRKLITSSVSTIAIYLIVLLQFKISLLNMKSIE from this orf (>comp2_seq1_2). I was able to get the preferred coding region translated from sixpack, but the problem with sixpack is that it only translates one sequence at a time (Please also correct me if that is not the case). Here is the DNA contig I used to translate all these six frames:

>comp2_seq1
GTAGGGTCGAGTGGCCAGCTCTGCCGATTTCAACAGGGCTAGGGGTAGGTTTATGTTTTTGTCGGTGCTAATGGTATAGC
TTTTGGGTTGAAAAATTATATTCATCATGGAAATAAAAGACCTGGCAGATTTATATGGCGACGAACTTAAATTAACAAAA
CTGATCAGAAAAAGCTCGGCACGTGCTCAGGAAATTGCTAAAAGACAAGAATTGGATTCTTCCGATGGACAAATCATCGA
TGACCATGATCAATTTTACAAAGATCACAAGTTGCTTCTTCTATTATTTAGAATACTGGGTGTGATGCCCATCGAACGTG
GAAAAATTGGAAGAATAACTTTTAGCTGGAAAAGCATTCCGATGATCTACGCATACGTCTTTTATGCTGTCATGACAGTT
ATAGTCGTCTTTGTGGGGATTGAAAGAGTCGACATATTGCTGAACAAGAGCAAAAAGTTTGACGAATATATCTACTCCAT
TATCTTCATTATTTTCTTGGTACCGCACTTCTGGATACCGTTCGTCCAGAAGGACATAGATAACTTTTGCACCGGATACATAATAGCC
CACTATAGAAGACTATGGCTAGAACTAAGCGAGCTCCTCCAGTCTATAGGAAATGCTTACGCAAGAACGTATTCTACGTA
TTCGCTGTTTATGATCACCAACATCACAGTTGCGACGTACGGCTTTATATCAGAAATCATGGAGCACGGGATAACGTTTT
CTTTCAAAGAAATGGGCCTTATTGTAGCCAGCGCGTATTGCATGGTGCTTCTGTACATCTACTGCGATTGCTCACATAAA
GCCTCAGATAATATAGCTCTGAGGATCCAGAGATCGCTAATAGAAATTGATCTAACTACGATTAATCTAGACACAGGAAA
AGAGATTGATATGTTTTTGACAGCAATTCGTCTAAATCCTCCAACAGTGTCTTTACAAGGCTATTCTGATGTTGATAGAA
AACTTATAACTTCAAGTGTTTCCACCATAGCGATCTACCTAATTGTCCTGCTACAATTCAAGATAAGTTTACTCAACATG
AAATCTATAGAATAAAGCTTAAATGATATATTTCTAGATTAAAATGCTAGATTATAGATTAAAATAAGTATGTAGGCACA
AGTTAAATGTTATTTTTGTTACAGGTTGATCTAATAAAGTTATCAACATAGCAATTCGAACGTTACAGCTAGCGCGGACA
CATGTCACATGGTTTTTGATTTACTCGATCTGTCTTCTATAAT

Any help would be appreciated. Thanks!

Here are the translated six frames:

  >comp2_seq1_1
VGSSGQLCRFQQG*G*VYVFVGANGIAFGLKNYIHHGNKRPGRFIWRRT*INKTDQKKLG
TCSGNC*KTRIGFFRWTNHR*P*SILQRSQVASSII*NTGCDAHRTWKNWKNNF*LEKHS
DDLRIRLLCCHDSYSRLCGD*KSRHIAEQEQKV*RIYLLHYLHYFLGTALLDTVRPEGHR
*LLHRIHNSPL*KTMARTKRAPPVYRKCLRKNVFYVFAVYDHQHHSCDVRLYIRNHGARD
NVFFQRNGPYCSQRVLHGASVHLLRLLT*SLR*YSSEDPEIANRN*SNYD*SRHRKRD*Y
VFDSNSSKSSNSVFTRLF*C**KTYNFKCFHHSDLPNCPATIQDKFTQHEIYRIKLK*YI
SRLKC*IID*NKYVGTS*MLFLLQVDLIKLST*QFERYS*RGHMSHGF*FTRSVFYN
>comp2_seq1_2
*GRVASSADFNRARGRFMFLSVLMV*LLG*KIIFIMEIKDLADLYGDELKLTKLIRKSSA
RAQEIAKRQELDSSDGQIIDDHDQFYKDHKLLLLLFRILGVMPIERGKIGRITFSWKSIP
MIYAYVFYAVMTVIVVFVGIERVDILLNKSKKFDEYIYSIIFIIFLVPHFWIPFVQKDID
NFCTGYIIAHYRRLWLELSELLQSIGNAYARTYSTYSLFMITNITVATYGFISEIMEHGI
TFSFKEMGLIVASAYCMVLLYIYCDCSHKASDNIALRIQRSLIEIDLTTINLDTGKEIDM
FLTAIRLNPPTVSLQGYSDVDRKLITSSVSTIAIYLIVLLQFKISLLNMKSIE*SLNDIF
LD*NARL*IKISM*AQVKCYFCYRLI**SYQHSNSNVTASADTCHMVFDLLDLSSIX
>comp2_seq1_3
RVEWPALPISTGLGVGLCFCRC*WYSFWVEKLYSSWK*KTWQIYMATNLN*QN*SEKARH
VLRKLLKDKNWILPMDKSSMTMINFTKITSCFFYYLEYWV*CPSNVEKLEE*LLAGKAFR
*STHTSFMLS*QL*SSLWGLKESTYC*TRAKSLTNISTPLSSLFSWYRTSGYRSSRRT*I
TFAPDT**PTIEDYG*N*ASSSSL*EMLTQERILRIRCL*SPTSQLRRTALYQKSWSTG*
RFLSKKWALL*PARIAWCFCTSTAIAHIKPQII*L*GSRDR**KLI*LRLI*TQEKRLIC
F*QQFV*ILQQCLYKAILMLIENL*LQVFPP*RST*LSCYNSR*VYST*NL*NKA*MIYF
*IKMLDYRLK*VCRHKLNVIFVTG*SNKVINIAIRTLQLARTHVTWFLIYSICLL*X
>comp2_seq1_4
IIEDRSSKSKTM*HVSALAVTFELLC**LY*INL*QK*HLTCAYILILIYNLAF*SRNIS
FKLYSIDFMLSKLILNCSRTIR*IAMVETLEVISFLSTSE*PCKDTVGGFRRIAVKNISI
SFPVSRLIVVRSISISDLWILRAILSEALCEQSQ*MYRSTMQYALATIRPISLKENVIPC
SMISDIKPYVATVMLVIINSEYVEYVLA*AFPIDWRSSLSSSHSLL*WAIMYPVQKLSMS
FWTNGIQKCGTKKIMKIME*IYSSNFLLLFSNMSTLSIPTKTTITVMTA*KTYA*IIGML
FQLKVILPIFPRSMGITPSILNNRRSNL*SL*N*SWSSMICPSEESNSCLLAIS*ARAEL
FLISFVNLSSSPYKSARSFISMMNIIFQPKSYTISTDKNINLPLALLKSAELATRPY
>comp2_seq1_5
YRRQIE*IKNHVTCVRASCNVRIAMLITLLDQPVTKITFNLCLHTYFNL*SSILI*KYII
*ALFYRFHVE*TYLEL*QDN*VDRYGGNT*SYKFSINIRIAL*RHCWRI*TNCCQKHINL
FSCV*INRS*INFY*RSLDPQSYII*GFM*AIAVDVQKHHAIRAGYNKAHFFERKRYPVL
HDF*YKAVRRNCDVGDHKQRIRRIRSCVSISYRLEELA*F*P*SSIVGYYVSGAKVIYVL
LDERYPEVRYQENNEDNGVDIFVKLFALVQQYVDSFNPHKDDYNCHDSIKDVCVDHRNAF
PAKSYSSNFSTFDGHHTQYSK**KKQLVIFVKLIMVIDDLSIGRIQFLSFSNFLSTCRAF
SDQFC*FKFVAI*ICQVFYFHDEYNFSTQKLYH*HRQKHKPTPSPVEIGRAGHSTLX
>comp2_seq1_6
L*KTDRVNQKPCDMCPR*L*RSNCYVDNFIRSTCNKNNI*LVPTYLF*SII*HFNLEIYH
LSFIL*ISC*VNLS*IVAGQLGRSLWWKHLKL*VFYQHQNSLVKTLLEDLDELLSKTYQS
LFLCLD*S*LDQFLLAISGSSELYYLRLYVSNRSRCTEAPCNTRWLQ*GPFL*KKTLSRA
P*FLI*SRTSQL*CW*S*TANT*NTFLRKHFL*TGGARLVLAIVFYSGLLCIRCKSYLCP
SGRTVSRSAVPRK**R*WSRYIRQTFCSCSAICRLFQSPQRRL*LS*QHKRRMRRSSECF
SS*KLFFQFFHVRWASHPVF*IIEEATCDLCKIDHGHR*FVHRKNPILVF*QFPEHVPSF
F*SVLLI*VRRHINLPGLLFP**I*FFNPKAIPLAPTKT*TYP*PC*NRQSWPLDPT
dna • 9.8k views
ADD COMMENT
0
Entering edit mode

I think we need to see the DNA sequence of comp2_seq1_2 to answer this?

ADD REPLY
0
Entering edit mode

Thanks for replying. I have revised the question with more details and the DNA sequence I used to translate.

ADD REPLY
4
Entering edit mode
10.6 years ago
Neilfws 49k

I don't think that transeq is the appropriate tool in this case.

You state that you are looking for "the best ORF." However, transeq has no concept of ORFs. It merely translates, by default in all frames.

You can supply additional options to transeq, such as the desired frame or a region defined by start and stop. But it will not "find" an ORF for you.

sixpack is somewhat better in that it has more options. For example, to output only ORFs that begin with Met:

sixpack -mstart myfastafile.fa

and if you examine the output from that you will see your "desired" ORF:

>comp2_seq1_2_ORF2  Translation of comp2_seq1 in frame 2, ORF 2, threshold 1, 318aa
MEIKDLADLYGDELKLTKLIRKSSARAQEIAKRQELDSSDGQIIDDHDQFYKDHKLLLLL
FRILGVMPIERGKIGRITFSWKSIPMIYAYVFYAVMTVIVVFVGIERVDILLNKSKKFDE
YIYSIIFIIFLVPHFWIPFVQKDIDNFCTGYIIAHYRRLWLELSELLQSIGNAYARTYST
YSLFMITNITVATYGFISEIMEHGITFSFKEMGLIVASAYCMVLLYIYCDCSHKASDNIA
LRIQRSLIEIDLTTINLDTGKEIDMFLTAIRLNPPTVSLQGYSDVDRKLITSSVSTIAIY
LIVLLQFKISLLNMKSIE

but again, the program cannot know that this is the "best orf you want to select".

I think you are correct in that if given a file with multiple sequences, sixpack will only process the first. So you'll need to search the Web for "split fasta" and choose a method that works for you - there are multiple options. One is csplit, but that limits you to files (as opposed to piping to/from STDOUT):

csplit myfastafile.fa '%^>%' '/^>/' '{*}'
ADD COMMENT
3
Entering edit mode

EMBOSS provides a number of tools to perform sequence translations of various kinds, see Applications in group Nucleic:translation and Applications in group Nucleic:gene finding. In this particular case, given the automation requirements you probably want to use 'getorf' since it will process multiple sequences, and provides the ORF sequences as the primary output.

ADD REPLY
0
Entering edit mode

Good tip, had forgotten about getorf.

ADD REPLY
0
Entering edit mode

Thanks everyone!

ADD REPLY
0
Entering edit mode

Thanks a lot ! The orf choice was from blast output. So once I split the fasta file into multiple single- sequence fasta, how can i submit the sequence for batch translation? If I have to do it individually, i will need to spend days on this. Thanks again for your suggestions regarding this!

ADD REPLY
0
Entering edit mode

If you're running EMBOSS locally on a Linux or similar machine, processing each sequence is easy using e.g. a bash loop.

ADD REPLY
0
Entering edit mode

I think if you know your ORF from BLAST output, you can use BioPython's

Bio.Seq import translate
ADD REPLY

Login before adding your answer.

Traffic: 1418 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6