Сalculating fold-enrichment of ChIP-seq peaks intersecting with promoters (vs. genome average)
6
18
Entering edit mode
7.0 years ago
chemcehn ▴ 210

Hello,

I need to calculate for my BED file with ChIP-seq peaks fold-enrichment of genomic regions containing or intersecting with promoters (versus genome average). How to do this?

I have tried GENOMATIX, which has this option, but it seems that it calculates fold-enrichment incorrectly (does not take into account different lengths of my genomic regions). I have also tried GREAT and CHIP-ENRICH, but what they both calculate is something like the distribution from each peak's midpoint to the nearest gene, which is not what I need.

Do you know any software to do this task?

PS. I do not need to calculate the read enrichment. I need to calculate the enrichment of peaks contained in my BED file, which intersect with promoters, versus the amount of peaks of this size which would intersect with promoters by chance based on genome-average probability to encounter a promoter.

Like this: My reads contain XX% of reads intersecting with promoters. It is expected that one would get YY% of reads of this length intersecting with promoters by chance, and therefore, the enrichment of promoters is equal to XX/YY-fold in comparison with what one would get by chance. So, I am looking for the solution of this problem. I am pretty sure it is realized already in some software, please advise where!

Thanks

sequencing next-gen ChIP-Seq genome gene • 14k views
16
Entering edit mode
7.0 years ago
Ryan Dale 5.0k

There are several existing options:

Also check out The dilemma of choosing the ideal permutation strategy while estimating statistical significance of genome-wide enrichment which gives a lot of good insight and has a lot of references to read up on.

For what it's worth, I've tried many of these (haven't tried GAT or regioneR) and have settled on IntervalStats. Since it's an asymmetric metric, it has the potential to give more biological insight. And hopefully it avoids bias from permutation strategy choice.

(edit: added GAT, and clarified that I haven't tried it yet)

(edit 2: added bedtools fisher as an interval-based alternative to the bp-based bedtools jaccard, as per @brentp's comment on another answer)

1
Entering edit mode

Nice suggestions, Ryan. @brentp and I have been discussion the IntervalStats approach from Chikina and Troyanskaya, as I have been interested in efficient ways to calculate the denominator of their P-value that is computed for each interval. Because I am lazy, how efficient is the existing implementation? Have you tried to implement this in pybedtools?

1
Entering edit mode

I haven't tried to implement in pybedtools yet.  It's been a while since I've run it, but I remember that one pairwise comparison took about as long as a 1000 iteration permutation test in pybedtools.  I haven't looked closely at the actual implementation though, so I have no idea if/how much it can be improved.

1
Entering edit mode

Also, GAT has been suggested as another approach that uses simulation:

0
Entering edit mode

Thanks, didn't know about GAT. I added it to the list in the answer, and I'll have to try it out.

0
Entering edit mode

thanks for plugging poverlap. it needs some love, but it is pretty flexible and can handle a lot of different null models. I wrote it while we were working on the paper you linked to.

0
Entering edit mode

@Ryan, Thank you very much for such an extensive list! I will try bedtools jaccard first.

@Aaron, Is there a way to call promoters for a given genome through bedtools, or do I need to create a file with promoters myself?

5
Entering edit mode
7.0 years ago

EDIT: After the PO edited his/her question my answer is no longer relevant, anyway in case someone is interested here it is...:

I have a script localEnrichmentBed.py which might be useful for you. It calculates the enrichment (as log2 fold change) and significance (p value from Fisher test) in target regions, promoters in your case, relative to the regions flanking the targets. So it doesn't use the genomic background but a local background.

I don't know what you want to do downstream but using the whole genome as background (effectively normalizing for library size) is going to give bias results read coverage can be quite uneven.

This is from the program's help:

localEnrichmentBed.py -h
usage: localEnrichmentBed.py [-h] --target TARGET --bam BAM --genome GENOME
[--slop SLOP] [--blacklist BLACKLIST]
[--tmpdir TMPDIR] [--keeptmp] [--verbose]
[--version]

DESCRIPTION
Compute the read enrichment in target intervals relative to local background.

Typical use case: A ChIP-Seq experiment on a sample returns a number of regions
of enrichment. We want to know how enriched these regions are in a *different*
sample. Note that enrichment is quantified relative to the local background
not relative to an input control.

treatment vs control.

OUTPUT:
bed file with header and columns:
1. chrom
2. start
3. end
4. targetID
5. flank_cnt
6. target_cnt
7. flank_len
8. target_len
9. log10_pval
10. log2fc

EXAMPLE
localEnrichmentBed.py -b rhh047.bam -t rhh047.macs_peaks.bed -g genome.fa.fai -bl blacklist.bed > out.bed

## Using pipes:
samtools view -u rhh047.bam chr18 \
| localEnrichmentBed.py -b - -t rhh047.macs_peaks.bed -g genome.fa.fai > out.bed

Useful tip: Get genome file from bam file:

samtools view -H rhh047.bam \
| grep -P "@SQ\tSN:" \
| sed 's/@SQ\tSN://' \
| sed 's/\tLN:/\t/' > genome.txt

REQUIRES:
- bedtools suite
- numpy, scipy

NOTES:
For PE reads, the second read in pair is excluded (by samtools view -F 128)
since coverageBed double counts pairs.

optional arguments:
-h, --help            show this help message and exit
--target TARGET, -t TARGET
Target bed file where enrichment is to be computed.
Use - to read from stdin.

--bam BAM, -b BAM     Bam file of the library for which enrichment is to be
computed. Use - to read from stdin.

--genome GENOME, -g GENOME
A genome file giving the length of the chromosomes.
A tab separated file with columns <chrom> <chrom lenght>.
NB: It can be created from the header of the bam file (see tip above).

--slop SLOP, -S SLOP  Option passed to slopBed to define the flanking region (aka background).
If int each target will be extended left and right this many bases.
If float each target is extended left and right this many times its size.
E.g. 5.0 (default) extends each target regions 5 times its length left and right.

--blacklist BLACKLIST, -bl BLACKLIST
An optional bed file of regions to ignore to compute
the local background. These might be unmappable regions with 0-counts which would
inflate the target enrichment.

--tmpdir TMPDIR       Temp dir to use for intermediate files. If not set
python will get one. A subdir will be created here.

--keeptmp             If set, the tmp dir is not deleted at the end of the
job (useful for debugging).

--verbose, -V         Print to stderr the commands that are executed.

--version             show program's version number and exit


4
Entering edit mode
6.6 years ago
bernatgel ★ 3.2k

Although this is an old thread, since it includes a thorough list of tools, I'll include another one we recently released.

If you are using R, you can use regioneR, that will perform the process sketched by @TriS in another answer. It uses a permutation test approach, so it's not the fastest way to do that, but does not need to rely in modeling and since it implements different randomization strategies and accepts arbitrary genome masks, the test can be tailored to your specific needs. In addition to testing for the overlap between regions, it can test for the distance, the % of overlap or actually any custom function you feed to it.

For example, if you have your peaks and promoters in data.frame's you can do something like:

library(regioneR)
pt <- overlapPermTest(A=peaks, B=promoters)
plot(pt)


or, if you want to evaluate the size of your peaks, you can define a function returning the mean height of the peaks overlapping the promoters and use it in your test:

meanHeight <- function(A, B, ...) {
overlapping.A <- subsetByOverlaps(A, B)
return(mean(overlapping.A\$height))
}
pt <- permTest(A=peaks, B=promoters, evaluate.function=meanHeight, randomize.function=randomizeRegions, ntimes=100)
plot(pt)


0
Entering edit mode

bernatel,

Since you mention regioneR - is there a discussion forum where users can contact regioneR developers? I like the permutation strategy but it is extremely slow. I have been running randomizeRegions() function for more than 10min [not finished yet] with no result coming up. The GRange files has >70,000 regions. If just one iteration of region randomization takes so much time I don't think it's feasible to do even more than 10 permutations but that is not enough really to have any statistical power.... Or else is it possible that the function is simply stuck without me even realizing it? As a test I have run randomizeRegions() on GRange file of 10 regions and it yielded a result relatively quickly.

0
Entering edit mode

You can find the email address of the package maintainer (in this case, me) in the pakage page at http://www.bioconductor.org/packages/regioneR. You could also write a message in bioconductor's support forum (https://support.bioconductor.org/) with a "regioneR" tag on it. In any case, regarding your question, we had a performance problem with the randomization process and it could take quite long to randomize large region sets. With the latest version in Bioconductor devel (to be released on October 2015) we have improved the performance of the randomization process 10 to 100x depending on the situation. In any case, to speed up things you could try with circularRandomizeRegions, which is MUCH faster, or set non.overlapping=FALSE in randomize regions, which would also speed up the computation quite a bit. In any case if you have any other problem you can write an email and we'll try to help you in using regioneR.

2
Entering edit mode
7.0 years ago
TriS ★ 4.5k

a pretty precise way of doing it is to bootstrap random genomic regions and get the % that overlaps with TSS regions, (edit - it's pretty similar to the approach the Ryan wrote, if you have fun coding, go for it)...to do this you:

- take the size of your set i.e. 100 regions with your xx% intersection

- randomly sample from the genome groups of 100 regions of the same size (bedtools allows you to do so) (with replacement)

- calculate the % of those that overlap and save value in array

- repeat 1000 or 10.000 times

- plot the distribution of this (expected) vs the xx% intersection that you have from your initial regions (this will be a histogram for the expected and 1 vertical line for the observed values)

- calculate the empirical pvalue as: ((number elements in simulated data  >= xx%) +1 / ((number sampling) + 1) (the +1 is to avoid dividing by zero)

to get your "fold enrichment" take the average of the simulated data vs xx%

or you could just use hypergeometric distribution, but you won't have the distribution of the simulated data

0
Entering edit mode
7.0 years ago
chemcehn ▴ 210

I did not mean to calculate the read enrichment. I meant to calculate the enrichment of promoters in my reads. Like this:

My reads contain XX% of reads intersecting with promoters. It is expected that one would get YY% of reads of this length intersecting with promoters by chance, and therefore, the enrichment of promoters is equal to XX/YY-fold in comparison with what one would get by chance. I am looking for the solution of this problem.

0
Entering edit mode
7.0 years ago
chemcehn ▴ 210

I have tried bedtools jaccard -- it is great, except that it does not do exactly what I need. It calculates the enrichment of basepairs intersecting in my BED file and in the BED file with promoters.

However, what I need the enrichment of regions in my BED file intersecting (having at least one bp overlap) with promoters. Could you please advise with respect to this? Thanks!

1
Entering edit mode

try bedtools fisher. as of the last release it is interval based.