Any way to merge peak overlap hit table more efficiently ?
3
0
Entering edit mode
4.6 years ago

Hi: I ran into bug when merging different peak overlap hit table, couldn't locate the issue efficiently. However, I used three bed files for finding overlap, so I implement function that return different index table. The point is, pattern of each index table is different from one to another. I come up solution to solve this issues, certainly it works for further meta process that I am gonna proceed, but when validating the output of my implementation with existing software tools, and final output for first bed files is perfectly matched while second, third bed files' result remain some difference (where each GRanges object will yield multiple output through several filtering process, and three bed files must be processed by parallel). How can I efficiently merge hit index table as I expected ? Any way to debug the issue?

Here is reproducible example:

idxTb_1 <- list(
foo = IntegerList(1,2,3,4,5,6,7,8,9,10),
bar = IntegerList(1,2,3,4,5,6,integer(0),7,integer(0),8),
cat = IntegerList(1,2,3,4,5,6,integer(0),7,8,10)
)

idxTb_2 <- list(
bar = IntegerList(1,2,3,4,5,6,7,8,9,10,11),
foo = IntegerList(1,2,3,4,5,6,8,10,11,integer(0),13),
cat = IntegerList(1,2,3,4,5,6,7,10,13,14, integer(0))
)

idxTb_3 <- list(
cat = IntegerList(1,2,3,4,5,6,7,8,9,10),
foo = IntegerList(1,2,3,4,5,6,8,9,integer(0),10),
bar = IntegerList(1,2,3,4,5,6,7,integer(0),integer(0),8)
)


So I come up with this solution:

t1 <- as.matrix(DataFrame(idxTb_1))
t2 <- as.matrix(DataFrame(idxTb_2[names(idxTb_1)]))
t3 <- as.matrix(DataFrame(idxTb_3[names(idxTb_1)]))

idxTb_final <- unique(DataFrame(rbind(t1, t2, t3)))


My approach is not desired, but this at least gives me perfect match result for only first bed files, so I need to figure out the where is issue come from. I suspect issue might come from merging index table. Can anyone give me idea how to address this issues ? what's the efficient way for merging index hit table? Any idea ?

R peak iranges sequencing ChIP-Seq • 1.3k views
3
Entering edit mode
4.6 years ago

If you want efficiency, you could try BEDOPS:

$bedops --merge A.bed B.bed C.bed > answer.bed  Once you have merged intervals, if you need a list of IDs from intervals that overlap with the merged set, you can use bedmap: $ bedmap --echo-map-id-uniq answer.bed <(bedops --everything A.bed B.bed C.bed) > ids.txt

0
Entering edit mode

Dear Alex:

Thanks for your prompt hit. Well, BEDOPS has done good job for detecting overlap regions. But, here is the things, if one big genomic interval in first bed file overlap with several small interval in second bed, so in my specification, I only keep one overlapped interval if multiple intersection happens, while I need roll around this process where small interval against big interval respectively, and I have done this all in R. To ease the computation easier, I intend to merge index table, but it did not yield perfect job for me. Would it possible some script of BEDOPS call in R ? Could you give me some idea to possibly solve my issues? Thanks :)

1
Entering edit mode

You can call system() in R to run commands. See: https://stat.ethz.ch/R-manual/R-devel/library/base/html/system.html

But perhaps I am not understanding what it is you are trying to do. Can you explain in a different way what you are trying to do?

For example, let us say you have three BED files that define intervals for peaks. Are you trying to get IDs of those peaks, where peaks from all three files overlap? Or are you trying to do something else?

Perhaps sketch or draw out what you are trying to do with some example peaks.

0
Entering edit mode

Dear Alex:

I sketch up the issue on the drafts here, which might explain the background problem easier. While here is the function I implemented and it works well, even I got correct result when validating with set of bed files. I set up several filtering to validate and save weakly enriched peaks through the evidence of overlaped peaks.

apparently, I need to have better idea to improve the code. Could you give me quick solution based on the above scripts and sketched issue I presented ? Thank you very much :)

3
Entering edit mode
4.6 years ago

I'm not really an R or Granges expert. But I'll describe how you can do the set operations you want to do with BEDOPS tools, which should run quickly.

(I have split my answer into two posts, because of word limits on a single post.)

Let's say the three rows in this figure are BED files called A.bed, B.bed and C.bed. As shorthand, we'll also call these sets A, B and C, containing their respective intervals. Set A has interval A1, A2, A3 and A4. And so on.

Further, let's assume these follow UCSC specification, and the p-value of each element in the BED file for A, B and C is in the file's fifth column, the so-called "score" or "signal" column.

It doesn't matter what you put in the fourth column (ID column), but the score needs to be in the fifth column.

It looks like you want to find the answer to the following questions:

1. What is the smallest p-value interval that overlaps between elements in A and elements in A, B and C?
2. What is the smallest p-value interval that overlaps between elements in B and elements in A, B and C?
3. What is the smallest p-value interval that overlaps between elements in C and elements in A, B and C?

You can use bedmap in BEDOPS to answer this question. This tool maps or makes associations between intervals between two sets based on their overlap. Then you can run operations on those mapped elements.

In this case, you can do score operations with bedmap that give you the lowest scoring element between all mapped overlaps, using the --min-element operation.

• What is the smallest p-value of intervals that overlap between elements in A and elements in all sets A, B and C?

First, we take the union of sets A, B and C; we'll call this unionABC:

$bedops --everything A.bed B.bed C.bed > unionABC.bed  Second, we run bedmap to map set A against unionABC. We add the --min-element operand to get the lowest scoring element that overlaps each of intervals A1, A2, A3 and A4, respectively: $ bedmap --echo --min-element A.bed unionABC.bed > answer_A_vs_unionABC.bed


Let's look at the file answer_A_vs_unionABC.bed line by line. There are four lines, one for each element in A: A1 through A4.

For A1, the elements that overlap A1 from sets A, B and C are: {A1, B1, B2, B3, C1, C2}. We include A1, because an element always overlaps itself. The lowest-scoring element among all of {A1, B1, B2, B3, C1, C2} is A1, with a p-value of 1e-08.

Likewise, for A2, the elements that overlap A2 from sets A, B and C are: {A2, B4, B5, C3, C4}. The lowest-scoring element from this set is A2, with a p-value of 1e-06.

For A3, the elements that overlap A3 from sets A, B and C are: {A3, B6}. The lowest-scoring element is A3: 1e-09.

Finally, for A4, the overlaps are {A4, B8, C6}. The lowest-scoring element is C6: 1e-08.

We can repeat this for the second question:

• What is the smallest p-value of intervals that overlap between elements in B and elements in A, B and C?

We reuse unionABC.bed, but instead of comparing A against this file, we compare B against the union:

$bedmap --echo --min-element B.bed unionABC.bed > answer_B_vs_unionABC.bed  For element B1, its overlaps are {A1, B1, C1}. The lowest-scoring element is A1: 1e-08. B2 overlaps {A1, B2, C2}. Again, the lowest-scoring element is A1: 1e-08. B3 overlaps {A1, B3, C2}. Again, the lowest-scoring element is A1: 1e-08. B4 overlaps {A2, B4, C3}. The lowest-scoring element is A2: 1e-06. And so on. ADD COMMENT 3 Entering edit mode 4.6 years ago It's the same process for answering the third question: $ bedmap --echo --min-element C.bed unionABC.bed > answer_C_vs_unionABC.bed


C1 maps to {A1, B1, C1}. The lowest-scoring element is A1: 1e-08.

Etc.

Each line of the answer_* files has two elements in it. The element from the reference set (A, B or C) and the minimum-scoring element from the union set of ABC. These two elements are separated by a pipe (|) character.

You can pass these files to awk to filter them for p-values of interest ("weakly-enriched"), using the pipe character to split the two elements and test if the score column in the second element is between your p-value threshold:

$awk '\ function parse_en(eng_num) { \ split(eng_num,en,/[eEdD]/); return en[1]*10**en[2]; \ } \ { \ n=split($0, elems, "|"); \
m=split(elems[2], minScoreElem, "\t"); \
pVal=parse_en(minScoreElem[5]); \
if ((pVal >= 0.000001) && (pVal <= 0.01)) { \
print \$0; \
} \


(Awk function to parse scientific notation from here.)

Or you could use a Perl or Python script to do the same — you could even read the file into R and process it there with string parsing functions, though I wouldn't recommend it.

Once you have filtered results for each set, you could figure out what set operations or statistical test you would want to run next to count elements from these results, and then determine if an element is weakly-interacting (or not).

BEDOPS runs fast because it requires that BED files are sorted. Other apps are starting to use sorting to get performance improvements, but I don't know if Granges now uses sorting to make its operations run quickly. In any case, to answer this particular question, BEDOPS bedmap and awk should run very quickly for you.