Question: Python -- Compare Fasta Sequences Using 'Sliding Window' Approach
0
gravatar for st.ph.n
5.8 years ago by
st.ph.n2.5k
Philadelphia, PA
st.ph.n2.5k wrote:

Greetings. I am trying to compare two FASTA files (one with one record, and another with multiple), using a sliding window approach. The second FASTA is very large (88gb), the first's sequence is 14kb.

Here's an example of what I want to do in python:

first_fasta

> 1
A C T T

second_fasta

>2
C C A C A C T T C T

Using a sliding window approach, which would depend on the length of the first.

window 1 = A C T = 2, 5, 3 = 11
window 2 = AC, CT, TT = 2, 2, 1 = 5
window 3 = ACT, CTT = 1, 1 = 2
window 4 = ACTT = 1

The end results I will plot the sums of the window sizes, in a histogram, with frequency on the y axis and window size on the x. The end result hopefully showing a single count for the last window size, indicating the full first sequence is contained in the second file.

I've created two small test files, but can't seem to get it right. I am new to python and appreciate any help. Is the best approach to modify Biopython's nucleic acid dot plot procedure? (http://biopython.org/DIST/docs/tutorial/Tutorial.html) (section 18.2.3)

fasta python biopython • 2.7k views
ADD COMMENTlink modified 5.8 years ago by Peter5.8k • written 5.8 years ago by st.ph.n2.5k
1

what is the exact output you expect for the sequences you posted?

ADD REPLYlink written 5.8 years ago by brentp23k

For the sequences posted, I would need the output to be the sums of the counts for each window, which I can use for the histogram. However the second file would be a large multi seq. fasta file. I've been attempting to remove all the headers and treat the entire file as a string, but it is too large.

ADD REPLYlink written 5.8 years ago by st.ph.n2.5k

Seriously? The second file is an 88 gigabyte FASTA file? Wow. Are you sure this is a sensible thing to be searching with a sliding window like this?

ADD REPLYlink written 5.8 years ago by Peter5.8k

Do you have an alternative suggestion, Peter? I need to know what occurences of the first 14kb sequence occur in the second file, and to what length of the sequence and frequency. I thought a sliding window approach as explained above would give the most desired and accurate results.

ADD REPLYlink written 5.8 years ago by st.ph.n2.5k

My appologies, the FASTA file is 8gb.

ADD REPLYlink written 5.8 years ago by st.ph.n2.5k
0
gravatar for Peter
5.8 years ago by
Peter5.8k
Scotland, UK
Peter5.8k wrote:

The nucleotide dot plot example in the Biopython Tutorial was never intended for such massive datasets - it is relatively slow, and was intended as an illustrative example.

Have a look online for dedicated tools for this, perhaps something from the mummer suite? http://mummer.sourceforge.net/examples/

ADD COMMENTlink written 5.8 years ago by Peter5.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1269 users visited in the last hour