Question: too many characters in strings containing long protein sequence ?
0
gravatar for gaiboyan23
8 months ago by
gaiboyan230
gaiboyan230 wrote:

Using Biopython, I parsed a FASTA file for a list of protein sequences. Everything works, except when the sequence is too long (some are over 1000 characters). A majority of the sequence gets replaced by ..., How can I obtain the entire sequence in my output?

Currently my output look like this (I pasted the first 3 lines)

MDSTLTASEIRQRFIDFFKRNEHTYVHSSATIPLDDPTLLFANAGMNQFKPIFL...VKN MAAYKLVLIRHGESTWNLENRFSCWYDADLSPAGHEEAKRGGQALRDAGYEFDI...AKK MATLSLTVNSGDPPLGALLAVEHVKDDVSISVEEGKENILHVSENVIFTDVNSI...RSY

My code (shortened version):

from Bio import SeqIO
every=[]
length=[]    
for seq_record in SeqIO.parse("Y100.fasta.", "fasta"):   
#going through each item in Y100.fasta  
     every.append (repr(seq_record.seq))   
#creating my list of protein sequence    
      length.append (len(seq_record))    
#length of every sequence

print (every)
ADD COMMENTlink modified 8 months ago by Joe14k • written 8 months ago by gaiboyan230
0
gravatar for Joe
8 months ago by
Joe14k
United Kingdom
Joe14k wrote:

The sequence doesn't get replaced with ....

All that's happening is that when you print it, the method that governs how its displayed in a terminal truncates it. You do not need to use the repr method. That is intended for debugging mostly. It's sufficient to just print/access the record.seq object (alternatively you can use str(record.seq)).

See the following example with the massive titin protein:

>>> from Bio import SeqIO
>>> titin = SeqIO.read('titin.fasta', 'fasta')
>>> titin
SeqRecord(seq=Seq('MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGV...TLT', SingleLetterAlphabet()), id='Titin', name='Titin', description='Titin', dbxrefs=[])
>>> print(titin)
ID: Titin
Name: Titin
Description: Titin
Number of features: 0
Seq('MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGV...TLT', SingleLetterAlphabet())
>>> print(titin.seq)
MTTQAPTFTQPLQSVVVLEGSTATFEAHISGFPVPEVSWFRDGQVISTSTLPGVQISFSDGRAKLTIPAVTKANSGRYSLKATNGSGQATSTAELLVKAETAPPNFVQRLQSMTVRQGSQVRLQVRVTGIPTPVVKFYRDGAEIQSSLDFQISQEGDLYSLLIAEAYPEDSGTYSVNATNSVGRATSTAELLVQGEEEVPAKKTKTIVSTAQISESRQTRIEKKIEAHFDARSIATVEMVIDGAAGQQLPHKTPPRIPPKPKSRSPTPPSIAAKAQLARQQSPSPIRHSPSPVRHVRAPTPSPVRSVSPAARISTSPIRSVRSPLLMRKTQASTVATGPEVPPPWKQEGYVASSSEAEMRETTLTTSTQIRTEERWEGRYGVQEQVTISGAAGAAASVSASASYAAEAVATGAKEVKQDADKSAAVATVVAAVDMARVREPVISAVEQTAQRTTTTAVHIQPAQEQVRKEAEKTAVTKVVVAADKAKEQELKSRTKEVITTKQEQMHVTHEQIRKETEKTFVPKVVISAAKAKEQETRISEEITKKQKQVTQEAIRQETEITAASMVVVATAKSTKLETVPGAQEETTTQQDQMHLSYEKIMKETRKTVVPKVIVATPKVKEQDLVSRGREGITTKREQVQITQEKMRKEAEKTALSTIAVATAKAKEQETILRTRETMATRQEQIQVTHGKVDVGKKAEAVATVVAAVDQARVREPREPGHLEESYAQQTTLEYGYKERISAAKVAEPPQRPASEPHVVPKAVKPRVIQAPSETHIKTTDQKGMHISSQIKKTTDLTTERLVHVDKRPRTASPHFTVSKISVPKTEHGYEASIAGSAIATLQKELSATSSAQKITKSVKAPTVKPSETRVRAEPTPLPQFPFADTPDTYKSEAGVEVKKEVGVSITGTTVREERFEVLHGREAKVTETARVPAPVEIPVTPPTLVSGLKNVTVIEGESVTLECHISGYPSPTVTWYREDYQIESSIDFQITFQSGIARLMIREAFAEDSGRFTCSAVNEAGTVSTSCYLAVQVSEEFEKETTAVTEKFTTEEKRFVESRDVVMTDTSLTEEQAGPGEPAAPYFITKPVVQKLVEGGSVVFGCQVGGNPKPHVYWKKSGVPLTTGYRYKVSYNKQTGECKLVISMTFADDAGEYTIVVRNKHGETSASASLLEEADYELLMKSQQEMLYQTQVTAFVQEPKVGETAPGFVYSEYEKEYEKEQALIRKKMAKDTVVVRTYVEDQEFHISSFEERLIKEIEYRIIKTTLEELLEEDGEEKMAVDISESEAVESGFDSRIKNYRILEGMGVTFHCKMSGYPLPKIAWYKDGKRIKHGERYQMDFLQDGRASLRIPVVLPEDEGIYTAFASNIKGNAICSGKLYVEPAAPLGAPTYIPTLEPVSRIRSLSPRSVSRSPIRMSPARMSPARMSPARMSPARMSPGRRLEETDESQLERLYKPVFVLKPVSFKCLEGQTARFDLKVVGRPMPETFWFHDGQQIVNDYTHKVVIKEDGTQSLIIVPATPSDSGEWTVVAQNRAGRSSISVILTVEAVEHQVKPMFVEKLKNVNIKEGSRLEMKVRATGNPNPDIVWLKNSDIIVPHKYPKIRIEGTKGEAALKIDSTVSQDSAWYTATAINKAGRDTTRCKVNVEVEFAEPEPERKLIIPRGTYRAKEIAAPELEPLHLRYGQEQWEEGDLYDKEKQQKPFFKKKLTSLRLKRFGPAHFECRLTPIGDPTMVVEWLHDGKPLEAANRLRMINEFGYCSLDYGVAYSRDSGIITCRATNKYGTDHTSATLIVKDEKSLVEESQLPEGRKGLQRIEELERMAHEGALTGVTTDQKEKQKPDIVLYPEPVRVLEGETARFRCRVTGYPQPKVNWYLNGQLIRKSKRFRVRYDGIHYLDIVDCKSYDTGEVKVTAENPEGVIEHKVKLEIQQREDFRSVLRRAPEPRPEFHVHEPGKLQFEVQKVDRPVDTTETKEVVKLKRAERITHEKVPEESEELRSKFKRRTEEGYYEAITAVELKSRKKDESYEELLRKTKDELLHWTKELTEEEKKALAEEGKITIPTFKPDKIELSPSMEAPKIFERIQSQTVGQGSDAHFRVRVVGKPDPECEWYKNGVKIERSDRIYWYWPEDNVCELVIRDVTAEDSASIMVKAINIAGETSSHAFLLVQAKQLITFTQELQDVVAKEKDTMATFECETSEPFVKVKWYKDGMEVHEGDKYRMHSDRKVHFLSILTIDTSDAEDYSCVLVEDENVKTTAKLIVEGAVVEFVKELQDIEVPESYSGELECIVSPENIEGKWYHNDVELKSNGKYTITSRRGRQNLTVKDVTKEDQGEYSFVIDGKKTTCKLKMKPRPIAILQGLSDQKVCEGDIVQLEVKVSLESVEGVWMKDGQEVQPSDRVHIVIDKQSHMLLIEDMTKEDAGNYSFTIPALGLSTSGRVSVYSVDVITPLKDVNVIEGTKAVLECKVSVPDVTSVKWYLNDEQIKPDDRVQAIVKGTKQRLVINRTHASDEGPYKLIVGRVETNCNLSVEKIKIIRGLRDLTCTETQNVVFEVELSHSGIDVLWNFKDKEIKPSSKYKIEAHGKIYKLTVLNMMKDDEGKYTFYAGENMTSGKLTVAGGAISKPLTDQTVAESQEAVFECEVANPDSKGEWLRDGKHLPLTNNIRSESDGHKRRLIIAATKLDDIGEYTYKVATSKTSAKLKVEAVKIKKTLKNLTVTETQDAVFTVELTHPNVKGVQWIKNGVVLESNEKYAISVKGTIYSLRIKNCAIVDESVYGFRLGRLGASARLHVETVKIIKKPKDVTALENATVAFEVSVSHDTVPVKWFHKSVEIKPSDKHRLVSERKVHKLMLQNISPSDAGEYTAVVGQLECKAKLFVETLHITKTMKNIEVPETKTASFECEVSHFNVPSMWLKNGVEIEMSEKFKIVVQGKLHQLIIMNTSTEDSAEYTFVCGNDQVSATLTVTPIMITSMLKDINAEEKDTITFEVTVNYEGISYKWLKNGVEIKSTDKCQMRTKKLTHSLNIRNVHFGDAADYTFVAGKATSTATLYVEARHIEFRKHIKDIKVLEKKRAMFECEVSEPDITVQWMKDDQELQITDRIKIQKEKYVHRLLIPSTRMSDAGKYTVVAGGNVSTAKLFVEGRDVRIRSIKKEVQVIEKQRAVVEFEVNEDDVDAHWYKDGIEINFQVQERHKYVVERRIHRMFISETRQSDAGEYTFVAGRNRSSVTLYVNAPEPPQVLQELQPVTVQSGKPARFCAVISGRPQPKISWYKEEQLLSTGFKCKFLHDGQEYTLLLIEAFPEDAAVYTCEAKNDYGVATTSASLSVEVPEVVSPDQEMPVYPPAIITPLQDTVTSEGQPARFQCRVSGTDLKVSWYSKDKKIKPSRFFRMTQFEDTYQLEIAEAYPEDEGTYTFVASNAVGQVSSTANLSLEAPESILHERIEQEIEMEMKELFSEGESEHSERDTRDAFSDSEDIDHKSMAAKRYASRISSTSSWPEYFKPSFTQKLTFKYVLEGEPVVFTCRLIACPTPEMTWFHNNRPIPTGLRRIIKAESDLHHHSSSLEIKRVQDRDSGSYRLLAINSEGSAESTASLLVIQKGQDEKYLEFLKRAERTHENVEALVERGEDRIKVDLRFTGSPFNKKQDVEQKGMMRTIHFKTMSSAKKTDYMYDEEYLESKSDIRGWLNVGESFLDKETKVKLQRLREARKTLMEKKKLSLLDTSSEISSRTLRSEASDKDILFSREDMKIRSMSDLAESYKVDHSAESIVQNPHALSNQMDQNIESEELPTSFQTIVDEEIFQTEIRMSQEALVKESLPKDHLYGEILVNENTQARGQLEEIMANTTIGESSTYITNVCEKEEVYETPENVSQAITPHASESFGTLVNVEESEEIASERIKKDDLRELQLSASTRIDEFKTEQKEENMRFFENSFRKRPQRCPPSFLQEIESQEVYEGD...
ADD COMMENTlink modified 8 months ago • written 8 months ago by Joe14k

(Ironically, titin is so big, I had to artificially truncate the protein when pasting it in to BioStars as there's a limit of 5000 characters!)

Here's a screengrab just in case you don't believe me haha

Screenshot-2019-01-30-at-18-18-57

ADD REPLYlink modified 8 months ago • written 8 months ago by Joe14k

Ah interesting to see someone else working with titin, and thank you for the response.

I tried your code and my protein sequence still gets truncated. Is there a way to get the non-truncated sequence (i'm putting them in an excel file)? My proteins are not nearly as long as titin (1000 some residues is the highest)

ADD REPLYlink written 8 months ago by gaiboyan230

Can you update your post with the full code you’re using? You must be making other errors as BioPython is a very mature package at this stage and it’s highly unlikely to be an issue with the codebase.

ADD REPLYlink written 8 months ago by Joe14k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 3299 users visited in the last hour