Wrapping protein sequence output using python
2
0
Entering edit mode
6.2 years ago
robjohn7000 ▴ 110

Hi,

Can someone please show me how to output protein sequence wrapped at 70 characters per line using the following code:

 

for record in SeqIO.parse(filename, "genbank"):
    for feature in record.features:
        if feature.type == "CDS":
            locus_tag = feature.qualifiers.get("locus_tag", ["NoGeneID"])[0]
            gene = feature.qualifiers.get("gene", ["NoGeneID"])[0]
            outfile.write("\t".join([locus_tag,gene])+"\n")

 

sequence gene python biopython • 2.1k views
ADD COMMENT
0
Entering edit mode

Untested, but add this function somewhere at the top:

def split_every_70(s): return [s[i:i+70] for i in range(0,len(s),70)]

then replace outfile.write(.....) with:

outfile.write("\".join([locus_tag,split_every_70(gene)])+"\n")

Also, are you trying to make a fasta file or something? Because if you want to keep this a tab-delimited table then using \n in both your data and for your newline delimiter is a bad idea.

But maybe thats just how it goes in the magical world of bioinformatics data formats :>

ADD REPLY
4
Entering edit mode
6.2 years ago
James Ashmore ★ 3.2k

Python has a nifty module called textwrap which you can use to wrap a long string, for example:

import textwrap
dna_seq = 'GTAAGTCCGCTCGCGTAGCTAGCTAGCTGACTGACTGACTGATCGAT'
textwrap.fill(dna_seq, width=5)
'GTAAG\nTCCGC\nTCGCG\nTAGCT\nAGCTA\nGCTGA\nCTGAC\nTGACT\nGATCG\nAT'
ADD COMMENT
0
Entering edit mode

This is cool - and its in the standard lib :) Thanks for sharing!

ADD REPLY
0
Entering edit mode

Pretty nice! Thanks James.

ADD REPLY
0
Entering edit mode

Great! Exactly what I wanted. It did the job. Many thanks to you James.

ADD REPLY
1
Entering edit mode
6.2 years ago
glihm ▴ 630

Hi there,

when I have to write a sequence and break line each "n" character, I use the re module from python. In the example, "\n" is inserted all 4 characters.

import re

seq = "ATCG" * 10
formated_seq = re.sub("(.{4})", "\\1\n", seq, 0, re.DOTALL)
print formated_seq

So, you can format your variable containing the sequence before writing it!

ADD COMMENT
0
Entering edit mode

John and and Wocka, thanks for your help and time.

ADD REPLY

Login before adding your answer.

Traffic: 2416 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6