Question: Fasta Sliding Window
0
gravatar for st.ph.n
6.9 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:

I'm trying to write a python script that uses a sliding window. Here is the code:

v = open("ex.fasta", "r")

def sliding_window(sequence, winSize, step):
        numOfChunks = ((len(sequence)-winSize)/step)+1
        for i in range(0,numOfChunks*step,step):
                yield sequence[i:i+winSize]

size = int(14786)
w = 500
while size > w:
        for line in v:
                if not line.startswith(">"):
                         myseq = line.rstrip()
                         myvect = sliding_window(myseq, 500, 500)
                         for r in myvect:
                                  print(r)

I want it to be able to produce chunks of the sequence in window sizes of 500, with a step size of 500, i.e. no overlap. However, the trouble I'm having is the lines for the fasta file are 76bp long for all lines except the last is ~40bp. Choosing anything <= 76 it will produce the desired outcome. Anything > 76 does not work. I've tried creating a string (using the whole sequence rather than a list, but it still does not work. Any help is appreciated.

python • 6.6k views
ADD COMMENTlink modified 6.9 years ago by a.zielezinski9.4k • written 6.9 years ago by st.ph.n2.5k

I really didn't get your question but you can concatenate multiple lines to make a complete sequence. See here:A: multiline fasta to single line fasta

ADD REPLYlink written 6.9 years ago by Ashutosh Pandey12k

Thanks Ashutosh, but i'm in the middle of learning python. I'd like to stay away from one-liners and perl, and keep it all to a single script.

ADD REPLYlink written 6.9 years ago by st.ph.n2.5k
4
gravatar for a.zielezinski
6.9 years ago by
a.zielezinski9.4k
a.zielezinski9.4k wrote:

As Philip and Neil suggested, use BioPython for reading FASTA records. As for memory usage, the parse function from SeqIO module is a generator that iterates over FASTA records and holds in memory one record at a time.

Hope this is what you wanted:

UPDATED 2:

Say, ex.fasta contains:

>seq
MAFAE
PAASS
LPNGD
CGR

Running this:

from Bio import SeqIO

def chunks(seq, win, step):
    seqlen = len(seq)
    for i in range(0,seqlen,step):
        j = seqlen if i+win>seqlen else i+win
        yield seq[i:j]
        if j==seqlen: break

for seq_record in SeqIO.parse("ex.fasta", "fasta"):
    seq = seq_record.seq
    for subseq in chunks(seq, 5, 5):
        print subseq

Outputs:

MAFAE
PAASS
LPNGD
CGR
ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by a.zielezinski9.4k

Thanks a.zielezinski. I'll take a closer look at your suggestion.

I modified the code a bit, using Biopython to read the flie. But it's cuting the file off and I"m not sure at what bp position, but it's not finishing the file. I think it's because the size of the file (in bp) is not divisible by 500. So, I tried to tell it to break when it hits the negative number it would hit (I realize that this is bad practice, but this is for testing purposes at the moment), but it's still not printing it.

from Bio import SeqIO

for record in SeqIO.parse("ex.fasta", "fasta"):
        v = record.seq

def sliding_window(sequence, winSize, step):
        numOfChunks = ((len(sequence)-winSize)/step)+1
        for i in range(0,numOfChunks*step,step):
                yield sequence[i:i+winSize]

size = 14820
m = 500

for line in v:
        myvect = sliding_window(v, m, m)
        for r in myvect:
                print(r)
        size -= m
        if size <= -180:
                break

Suggestions on this current piece of code is appreciated.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by st.ph.n2.5k

Your approach would work since I want step size equal to the window size so there is no overlap. What if I wanted say a step size of 1? That can be easily modified by the function I wrote above, by changing the input for the step size.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by st.ph.n2.5k

I just updated my answer. I included your function as it allows step to be smaller than winSize. Does this answer your question?

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by a.zielezinski9.4k

Yes. In order to change the step size. So, your code works exactly like the edit I posted. However, the issue I'm still having is the last window for the sequence is not printed b/c subtracting 500 from it puts it below 0 with size -= m. I sill need that last bit of sequence, and am unsure how to tell it to print.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by st.ph.n2.5k

Now I understand. I fixed it and updated my answer.

ADD REPLYlink written 6.9 years ago by a.zielezinski9.4k
1

Perfect! Dziękuję bardzo!

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by st.ph.n2.5k
1
gravatar for Philipp Bayer
6.9 years ago by
Philipp Bayer6.9k
Australia/Perth/UWA
Philipp Bayer6.9k wrote:

For the sequence as one string, BioPython is the laziest way.

from Bio import SeqIO
for seq in SeqIO.parse("ex.fasta", "fasta"):
    myseq = str(seq.seq) # is now one huge string of all your nucleotides for each sequence in ex.fasta

For the sliding window, have a look at collections.deque. Just intialize a deque with your string and pop 500 from the left; if it raises an IndexError, you're at the end. Will post something later.

Edit: Here's something ugly with collections.deque:

def slide(a_deque, size):
    done = False
    while True:
        counter = 0
        sub = []
        while counter < size:
            try:
                sub.append(a_deque.popleft())
            except IndexError:
                done = True
                break
            counter += 1
        print sub
        if done:
            break
from collections import deque
slide(deque(myseq), 500)

Each of the "sub"-lists should be of length 500 if size = 500, except for the last one. You can then do with "sub" whatever you like. Haven't really tested this.

Note: This method looses the speed advantages of a deque since it re-introduces a list. There are many better ways.

ADD COMMENTlink modified 6.9 years ago • written 6.9 years ago by Philipp Bayer6.9k
1

Aha, you posted this as I was writing mine :) Well, no harm in keeping both answers.

ADD REPLYlink written 6.9 years ago by Neilfws49k

I wouldn't recommend this 'slide' function to anyone:) The 'sliding_window' function from tue87843 is much simpler and faster.

ADD REPLYlink written 6.9 years ago by a.zielezinski9.4k
0
gravatar for Neilfws
6.9 years ago by
Neilfws49k
Sydney, Australia
Neilfws49k wrote:

There's no need to reinvent the FASTA parser. Biopython will do this for you. See the tutorial.

Example: let's say file myseq.fa contains:

>myseq1
acaatcatactcta
aagatcgatgatag

Then the sequence string is accessed like so:

>>> from Bio import SeqIO
>>> for record in SeqIO.parse("seq.fa", "fasta"):
...     print(record.seq)
... 
acaatcatactctaaagatcgatgatag
ADD COMMENTlink written 6.9 years ago by Neilfws49k

I've used Biopython for other things, and I know how to open files and do some other nifty things. Does using parse store it to memory? I would later like to use this script for a larger fasta file, say human sequence.

ADD REPLYlink written 6.9 years ago by st.ph.n2.5k

No it loads it up from the disk to memory for each entry in the fasta file. If you put the whole human genome as a single fasta entry it will load all of it into ram. Python isn't the leanest with memory as it is and biopython doesn't help, I suggest you avoid it and use numpy character arrays to reduce the memory footprint if you must work with very large sequence.

ADD REPLYlink modified 6.9 years ago • written 6.9 years ago by pld4.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2217 users visited in the last hour
_