This is a bit conceptual but say I wanted to compare human and mouse, looking at human chr1 and chr2 and mouse chr1 and chr2. This would look like so


                 chr1        chr2
               +-----------+  --------+
               |  XXX      |          |
          chr1 |    X      |          |
               |         X |    X     |
mouse          |           |          |
               |           |     X    |
          chr2 |      X    |          |
               |           |   X      |
               |           |          |


If I have data in something like a BEDPE file, naively, I could go over all the matches in the BEDPE file and see:

  • If it matches human chr1 or human chr2, then see if that also matches the mouse chr1 or chr2, and it so, emit it

Or in two steps

  • Query for lines that match human chr1, and if so, see if it matches mouse chr1 or chr2

  • Query for lines that match human chr2, and if so, see if it matches mouse chr1 or chr2

So that makes it more clear that "a query for a human coordinate needs the full range of what mouse is also"

In this scenario I also don't need to query in the mouse "direction", going from the human direction is sufficient

This seems easy enough, but I am trying to consider more efficient options too, maybe where I don't have to load the whole file into memory

One idea I had was something like this involving tabix indexing. Instead of a single file, I sort it twice and make two tabix files.

sort -k1,1 -k2,2n input.bedpe > input.human.bedpe
tabix -b1 -s2 input.human.bedpe
sort -k4,4 -k5,5n input.bedpe > input.mouse.bedpe
tabix -b4 -s5 input.mouse.bedpe

Then to query, I actually do query it in both directions

tabix input.human.bedpe chr1 && tabix input.human.bepe chr2 > human_results
tabix input.mouse.bedpe chr1 && tabix input.mouse.bepe chr2 > mouse_results
intersect human_results and mouse_results > final

This final set of lines would contain my desired output I think.

This seems like it is not super efficient though because I am ending up with things like human chr1 matching to mouse chr10, which I don't care about, in my initial output before the intersection. I could also try filtering while I'm outputting so it is more like this

tabix input.human.bedpe chr1 && tabix input.human.bepe chr2  | filter_for_mouse_regions_of_interest_e.g._mouse_chr1_and_mouse_chr2 > final

This seems like a reasonable query format. It also doesn't seem like I have to query the file in both directions?

Does this seem like a reasonable system? If I had a proper database system would there be an even better way to do this? Is there any literature or keywords to look for topics like this

