I am attempting to compare two genomes from the same individual and finding small differences between them. My idea would be to write a script to find small windows (~15-20 BP) that are present in one genome (lets call it abnormal) but not the other (normal.) The program will break each genome into ~10-20 BP windows, using a sliding window. The windows go into two databases, one for windows from a normal cell genome and one for windows from an abnormal cell genome. The database would be keyed by the actual window itself and for each key there would be a link to each spot in the genome where it is found.
The idea is to do the following:
- For each window in the abnormal database, see if it is also present in the normal database. If it is, delete it.
- Return all windows that remain in abnormal database, prioritize those with the highest number of occurances.
I have done this for small test data. For real human genomes, there will be about 6 billion or so windows per database. Right now I have two ways I can think of handling this:
Install OS and MySQL on a 1TB SSD, use script to directly populate databases. Use normal RAID for mass storage needed for original genomes. Use database queries to compare genomes.
Install OS on 1TB SSD, with around 700 GB for swap. Use RAID for mass storage of genome and MySQL DB for results of comparing genomes. Instead of comparing databases within MySQL, implement simple hash tables in Perl script, let OS go to swap as structure builds. Do comparisons within Perl script and dump results to MySQL database on the RAID.
What would be the best way to handle this? Is there an existing solution out there that does exactly what I'm trying to do?
Thank you in advance!