Hello. I have a bed file with about 90 MBs and I need to find the overlaps between multiple bed files (sums to about 800MB) each containing sequences using Python. I have enough processing power but I need to simpify this process. I suspected using an interval tree was a good choice and found this: https://pypi.python.org/pypi/intervaltree_bio but I could not get further.
I have about 60 cell line datas each with a BED file about 6-10 MBs. I have a directory containing directories of the names of .bed files and .pk (peak files) and each of these directories have one bed file.
Is it possible for anyone to give me a specific advice on how to do this task? Thank you very much.
Main .bed file example queries:
chr20 30053341 30053368 DEFB124 70.6955419 +
chr20 30053397 30053424 DEFB124 63.90851928 +
.pk cellline file example queries:
chr1 713835 714424 chr1.1 1000 . 0.1621 10.6 -1 253
chr1 752775 753050 chr1.2 567 . 0.0365 2.09 -1 124
.bed cellline file example queries
chr1 91425 91575 id-4576 9
chr1 714005 714155 id-35705 186.000000