Question

comparing two tracks over intervals on the genome

0

Entering edit mode

8.6 years ago

moxu ▴ 510

Suppose I have two groups: treatment & control. Each group has a value assigned to each interval on the genome. Let me explain with the following example:

File A contains data for the treatment group and has the following coordinate sorted format:

chr1 100 110 3
chr1 200 212 8
chr1 220 240 5
chr1 500 600 1000
...

File B contains data for the control group:

chr1 100 108 7
chr1 150 160 14
chr1 210 230 88
chr1 700 850 3
...

And I would like to get results like the following: chr1 100 110 3 7

chr1 150 160 0 14
chr1 200 240 13 88
chr1 500 600 1000 0
chr1 700 850 0 3
...

The 3rd entry is a merge of two tracks on A sided by an entry on B.

Is there a software product to do so?

Thank you!

next-gen alignment software error • 1.9k views

ADD COMMENT • link updated 8.6 years ago by Alex Reynolds 36k • written 8.6 years ago by moxu ▴ 510

score 0 · Answer 1 · 2016-12-15

0

Entering edit mode

8.6 years ago

moxu ▴ 510

Well, thought "bedtools intersect -c" would work, but in effect it does not -- it serves more like a SQL "left join".

ADD COMMENT • link 8.6 years ago by moxu ▴ 510

score 0 · Answer 2 · 2016-12-16

0

Entering edit mode

8.6 years ago

Vitis ★ 2.6k

I think this may be able to do it: standardize the windows on the left side, instead of using jumping windows, pick a window size and lay them out across the entire genome (such as chr1 1 10, chr1 11 20, chr1 21, 30...), leaving some windows with zero values. Then it might be easier to merge your data as you like.

ADD COMMENT • link 8.6 years ago by Vitis ★ 2.6k

0

Entering edit mode

What a novel idea! Thanks for your input. However, it seems too complicated. What I am planning to do is to do "bedtools intersect -c -a A -b B" & "bedtools intersect -c -a B -b A", and then combine the results somehow.

ADD REPLY • link 8.6 years ago by moxu ▴ 510

0

Entering edit mode

Well, the solution based on bedops below is perfectly fine for this type of problem, I usually create this type of data with seamless chromosome windows for data standardization and documentation purpose. Somewhere down the road, you may repetitively need these operations to intersect new data. The standardized windows would make your life much easier in the future.

ADD REPLY • link 8.6 years ago by Vitis ★ 2.6k

score 0 · Answer 3 · 2016-12-17

Via BEDOPS:

Merge sets A and B with bedops:

$ bedops --merge A B > AB.merge

Map set A to the merged set and sum the ID values with bedmap and awk:

$ bedmap --echo-map-id --multidelim '\t' AB.merge A | awk '{if(NF==0){print 0;}else{s=0;for(i=1;i<=NF;i++){s+=$i;}print s;}}' > A.merge.sum

Repeat with set B:

$ bedmap --echo-map-id --multidelim '\t' AB.merge B | awk '{if(NF==0){print 0;}else{s=0;for(i=1;i<=NF;i++){s+=$i;}print s;}}' > B.merge.sum

Paste the AB merged set to the A and B merge-sums with paste:

$ paste AB.merge A.merge.sum B.merge.sum
chr1    100 110 3       7
chr1    150 160 0       14
chr1    200 240 13      88
chr1    500 600 1000    0
chr1    700 850 0       3