Question

Determine the significance of peak overlaps (bedtools)

0

Entering edit mode

3.1 years ago

kstangline ▴ 80

Hello,

I'm trying to determine the significance of eCLIP (similar to ChIP-seq, but RNA) peak overlaps.

I have used bedtools intersect to find how many overlaps there are between two bed files.

bedtools intersect -u -a file1.bed -b file2.bed -wa | wc -l

I thought about using bedtools fisher with the following command.

bedtools fisher -a file1.bed -b file2.bed -g genomeFile.bed

however, I'm getting inflated overlaps compared to bedtools intersect. If I take out the -u option in bedtools intersect, I get the same amount of overlaps. I see no option in the bedtools fisher documentation to replicate -u, which does the following:

"Write original A entry once if any overlaps found in B. In other words, just report the fact at least one overlap was found in B. Restricted by -f and -r."

Am I doing something wrong here? Should I shuffle the bed file?

bedtools peak ChIP-seq eCLIP • 1.2k views

ADD COMMENT • link updated 3.1 years ago by Istvan Albert 100k • written 3.1 years ago by kstangline ▴ 80

score 0 · Answer 1 · 2021-03-30

To do the test you need to account for the possible multiple overlaps, the rationale of the statistical test to tell you what is the likelihood of observing the overlap why chance alone. For that it wants you to count all the overlaps that are possible.

As an extreme example imagine that in file a you have a single long interval that covers the entire genome and with that covers all intervals in b is your desired outcome to claim that your interval overlaps with just a single feature in b? and that it should be counted as 1?

Working out these extreme scenarios will lead you to a better understanding of how to formulate your questions.

Finally if you want to replicate a behavior you might need to make a new bedfile that has the same effect. For example you run bedtools intersect without the -u then take the first hits and make another file b, let's call it b' that only contains the intervals produced by the -u flag. Now run your fisher test on that. Still make sure it makes sense.