I guess it is more of an algorithm question. I have a group of large bedgraph files (shown below, contains every single position of the genome), each is about 20-30Gb in size. I need to look into these files and locate the lines having locations match another file (small, < 100kb) (e.g. file test.txt, see below). How should I do it efficiently please?
I read posts such as: https://stackoverflow.com/questions/6219141/searching-for-a-string-in-a-large-text-file-profiling-various-methods-in-pytho but it is more for a general situation.
I think for my case I should take advantage of the fact that my files are "sorted", and that each time I use a line in "text.txt", to locate a line (line X for example) in bedgraph, anything above lineX is no longer relevant. I have some draft ideas about how to write it out in python, but also would like to hear suggestions from you.
I don't mind calling other languages in my python code. If there are existing software doing similar things, please let me know also. Thank you very much.
#largeFile.bedgraph (0 system) chrM 0 1 5183 chrM 1 2 5299 chrM 2 3 5344 chrM 3 4 5439 chrM 4 5 5525 chrM 5 6 5579 ... chr19 3 4 6 chr20 4 5 35 #test.txt (1 system) chrM 3 chrM 4 chr20 5
I would like to output:
chrM 3 4 5439 chrM 4 5 5525 chr20 4 5 35
because they are the record match test.txt.