Strategy for 2D interval query?
0
0
Entering edit mode
4.0 years ago
cmdcolin ★ 3.8k

Hi there

This is a bit conceptual but say I wanted to compare human and mouse, looking at human chr1 and chr2 and mouse chr1 and chr2. This would look like so

                      human



                 chr1        chr2
               +-----------+  --------+
               |  XXX      |          |
          chr1 |    X      |          |
               |         X |    X     |
mouse          |           |          |
               +----------------------+
               |           |     X    |
          chr2 |      X    |          |
               |           |   X      |
               |           |          |
               +-----------+----------+

```

If I have data in something like a BEDPE file, naively, I could go over all the matches in the BEDPE file and see:

  • If it matches human chr1 or human chr2, then see if that also matches the mouse chr1 or chr2, and it so, emit it

Or in two steps

  • Query for lines that match human chr1, and if so, see if it matches mouse chr1 or chr2

  • Query for lines that match human chr2, and if so, see if it matches mouse chr1 or chr2

So that makes it more clear that "a query for a human coordinate needs the full range of what mouse is also"

In this scenario I also don't need to query in the mouse "direction", going from the human direction is sufficient

This seems easy enough, but I am trying to consider more efficient options too, maybe where I don't have to load the whole file into memory

One idea I had was something like this involving tabix indexing. Instead of a single file, I sort it twice and make two tabix files.

sort -k1,1 -k2,2n input.bedpe > input.human.bedpe
tabix -b1 -s2 input.human.bedpe
sort -k4,4 -k5,5n input.bedpe > input.mouse.bedpe
tabix -b4 -s5 input.mouse.bedpe

Then to query, I actually do query it in both directions

tabix input.human.bedpe chr1 && tabix input.human.bepe chr2 > human_results
tabix input.mouse.bedpe chr1 && tabix input.mouse.bepe chr2 > mouse_results
intersect human_results and mouse_results > final

This final set of lines would contain my desired output I think.

This seems like it is not super efficient though because I am ending up with things like human chr1 matching to mouse chr10, which I don't care about, in my initial output before the intersection. I could also try filtering while I'm outputting so it is more like this

tabix input.human.bedpe chr1 && tabix input.human.bepe chr2  | filter_for_mouse_regions_of_interest_e.g._mouse_chr1_and_mouse_chr2 > final

This seems like a reasonable query format. It also doesn't seem like I have to query the file in both directions?

Does this seem like a reasonable system? If I had a proper database system would there be an even better way to do this? Is there any literature or keywords to look for topics like this

bedpe comparative-genomics conceptual • 697 views
ADD COMMENT

Login before adding your answer.

Traffic: 3260 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6