Question: Whole-genome percentile values in bigWig files
gravatar for rachel
9 months ago by
rachel30 wrote:

Hi all,

I'm looking to compute the k-th percentile value of some bigWig files (ChIP-seq) across the whole genome. For example, I'd like to know the 99.99th percentile value of a bigWig file across the whole genome.

I'm aware of wiggletools histogram (unfortunately does not provide exact percentile values, which I need) and the BEDOPS answer suggested in this question. However, the BEDOPS answer only computes percentile values on a per-chromosome basis, whereas I need whole-genome percentile values. I have tried to modify it to compute percentile values on a whole-genome basis to no avail.

If anyone has any ideas, I would appreciate them!



ADD COMMENTlink modified 9 months ago by Alex Reynolds30k • written 9 months ago by rachel30
gravatar for Alex Reynolds
9 months ago by
Alex Reynolds30k
Seattle, WA USA
Alex Reynolds30k wrote:

Unless I'm missing something about the question, here are a couple possibilities:

Perhaps use a pseudochromosome name for all records:

$ bigWigToBedGraph
$ awk -vFS="\t" -vOFS="\t" '{ $5 = $4; $4 = $1; $1 = "chrPseudo"; print $0; }' | sort-bed - > signal.pseudoChr.bed
$ echo -e "chrPseudo\t0\t248956422" > chrPseudo.extents.bed
$ bedmap --echo --kth 0.9999 chrPseudo.extents.bed signal.pseudoChr.bed > answer.bed

I'm assuming that the ChIP-seq input comes from hg38, which is why I create an "extents" file that extends over the longest chromosome in the real hg38 assembly, i.e., chr1. Whatever intervals come out of that awk statement, they will never extend past the longest, real chromosome. If you're not using hg38, you could change 248956422 to match the longest chromosome in your assembly.

That's a pretty convoluted approach.

Another option might be to just strip all the signal from into its own text file, and run that file through R.

$ bigWigToBedGraph
$ cut -f4 > signal.txt
$ R
> d <- read.table("signal.txt")
> quantile(d$V1, c(0.9999))
...some answer...

Running ?quantile in R should give more detail about that command.

Even though I'd love for everyone to use BEDOPS, using R or even Python numpy.quantile would probably be faster than rejigging a bigWig into a resorted BED file with a fake chromosome name.

Hope this helps!

ADD COMMENTlink modified 9 months ago • written 9 months ago by Alex Reynolds30k

I just wanted to say that your first option worked beautifully! (I already had everything set up to use bedops so it was easy to adapt to). Thank you so much for responding to my question :)

ADD REPLYlink written 8 months ago by rachel30
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 867 users visited in the last hour