Question

Python: Slicing Sequences In Fasta File

0

Entering edit mode

10.7 years ago

Bioinformatics • 0

I want to slice sequences of fasta file,I take the first three sequences( I must calculate the length of each sequence), for example: I have this three sequences I want to divide each sequences on sub-sequences have the same length.

ie:length of the first is 28 , the second is 39 , and the third is 46 I divide each sequence on 9 28/9=3 the rest is 1 so the last sub-sequence contain one base 'G' in this cases I must add this character '-', 39/9=4 ( do the same thing as the first sequence),46/9=5(the same )

>gi|2765658|emb|Z78533.1|CIZ78533 C.irapeanum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTG

>gi|2765657|emb|Z78532.1|CCZ78532 C.californicum 5.8S rRNA gene and ITS1 and ITS2 DNA
CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATC


>gi|2765646|emb|Z78521.1|CCZ78521 C.calceolus 5.8S rRNA gene and ITS1 and ITS2 DNA
GTAGGTGAACCTGCGGAAGGATCATTGTTGAGACAGTAGAATATAT

then, I take three sub-sequences from each sequences

CGTAACAAG GTTTCCGTA GGTGAACCT 
CGTAACAAG GTTTCCGTA GGTGAACCT
GTAGGTGAA CCTGCGGAA GGATCATTG

then, I apply some function on each sub-group :

function1('CGTAACAAG'), function1('GTTTCCGTA'), ...

The same thing with

function2

I want to apply this on all sequences in fasta file, it means each time I take three sequences.

what can I do?

python sequence • 3.2k views

ADD COMMENT • link updated 10.7 years ago by Istvan Albert 100k • written 10.7 years ago by Bioinformatics • 0

score 0 · Answer 1 · 2013-08-04

from you example it is unclear what is the role of the - character is, your output does not seem to show these. The string splitting at fixed size could be done like so

step = 9
seq = "CGTAACAAGGTTTCCGTAGGTGAACCTGCGGAAGGATC"

parts = []
for i in range(len(seq)/step): 
    sub = seq[i * step: (i + 1) * step]
    parts.append(sub)

print parts

would print

['CGTAACAAG', 'GTTTCCGTA', 'GGTGAACCT', 'GCGGAAGGA']