Greetings. I am trying to compare two FASTA files (one with one record, and another with multiple), using a sliding window approach. The second FASTA is very large (88gb), the first's sequence is 14kb.
Here's an example of what I want to do in python:
first_fasta
> 1
A C T T
second_fasta
>2
C C A C A C T T C T
Using a sliding window approach, which would depend on the length of the first.
window 1 = A C T = 2, 5, 3 = 11
window 2 = AC, CT, TT = 2, 2, 1 = 5
window 3 = ACT, CTT = 1, 1 = 2
window 4 = ACTT = 1
The end results I will plot the sums of the window sizes, in a histogram, with frequency on the y axis and window size on the x. The end result hopefully showing a single count for the last window size, indicating the full first sequence is contained in the second file.
I've created two small test files, but can't seem to get it right. I am new to python and appreciate any help. Is the best approach to modify Biopython's nucleic acid dot plot procedure? (http://biopython.org/DIST/docs/tutorial/Tutorial.html) (section 18.2.3)
what is the exact output you expect for the sequences you posted?
For the sequences posted, I would need the output to be the sums of the counts for each window, which I can use for the histogram. However the second file would be a large multi seq. fasta file. I've been attempting to remove all the headers and treat the entire file as a string, but it is too large.
Seriously? The second file is an 88 gigabyte FASTA file? Wow. Are you sure this is a sensible thing to be searching with a sliding window like this?
Do you have an alternative suggestion, Peter? I need to know what occurences of the first 14kb sequence occur in the second file, and to what length of the sequence and frequency. I thought a sliding window approach as explained above would give the most desired and accurate results.
My appologies, the FASTA file is 8gb.