Greetings. I am trying to compare two FASTA files (one with one record, and another with multiple), using a sliding window approach. The second FASTA is very large (88gb), the first's sequence is 14kb.
Here's an example of what I want to do in python:
> 1 A C T T
>2 C C A C A C T T C T
Using a sliding window approach, which would depend on the length of the first.
window 1 = A C T = 2, 5, 3 = 11 window 2 = AC, CT, TT = 2, 2, 1 = 5 window 3 = ACT, CTT = 1, 1 = 2 window 4 = ACTT = 1
The end results I will plot the sums of the window sizes, in a histogram, with frequency on the y axis and window size on the x. The end result hopefully showing a single count for the last window size, indicating the full first sequence is contained in the second file.
I've created two small test files, but can't seem to get it right. I am new to python and appreciate any help. Is the best approach to modify Biopython's nucleic acid dot plot procedure? (http://biopython.org/DIST/docs/tutorial/Tutorial.html) (section 18.2.3)