Python -- Compare Fasta Sequences Using 'Sliding Window' Approach
1
0
Entering edit mode
10.2 years ago
st.ph.n ★ 2.7k

Greetings. I am trying to compare two FASTA files (one with one record, and another with multiple), using a sliding window approach. The second FASTA is very large (88gb), the first's sequence is 14kb.

Here's an example of what I want to do in python:

first_fasta

> 1
A C T T

second_fasta

>2
C C A C A C T T C T

Using a sliding window approach, which would depend on the length of the first.

window 1 = A C T = 2, 5, 3 = 11
window 2 = AC, CT, TT = 2, 2, 1 = 5
window 3 = ACT, CTT = 1, 1 = 2
window 4 = ACTT = 1

The end results I will plot the sums of the window sizes, in a histogram, with frequency on the y axis and window size on the x. The end result hopefully showing a single count for the last window size, indicating the full first sequence is contained in the second file.

I've created two small test files, but can't seem to get it right. I am new to python and appreciate any help. Is the best approach to modify Biopython's nucleic acid dot plot procedure? (http://biopython.org/DIST/docs/tutorial/Tutorial.html) (section 18.2.3)

python fasta biopython • 4.4k views
ADD COMMENT
1
Entering edit mode

what is the exact output you expect for the sequences you posted?

ADD REPLY
0
Entering edit mode

For the sequences posted, I would need the output to be the sums of the counts for each window, which I can use for the histogram. However the second file would be a large multi seq. fasta file. I've been attempting to remove all the headers and treat the entire file as a string, but it is too large.

ADD REPLY
0
Entering edit mode

Seriously? The second file is an 88 gigabyte FASTA file? Wow. Are you sure this is a sensible thing to be searching with a sliding window like this?

ADD REPLY
0
Entering edit mode

Do you have an alternative suggestion, Peter? I need to know what occurences of the first 14kb sequence occur in the second file, and to what length of the sequence and frequency. I thought a sliding window approach as explained above would give the most desired and accurate results.

ADD REPLY
0
Entering edit mode

My appologies, the FASTA file is 8gb.

ADD REPLY
0
Entering edit mode
10.2 years ago
Peter 6.0k

The nucleotide dot plot example in the Biopython Tutorial was never intended for such massive datasets - it is relatively slow, and was intended as an illustrative example.

Have a look online for dedicated tools for this, perhaps something from the mummer suite? http://mummer.sourceforge.net/examples/

ADD COMMENT

Login before adding your answer.

Traffic: 2998 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6