Question

Whole-genome percentile values in bigWig files

2

Entering edit mode

5.6 years ago

rachel ▴ 30

Hi all,

I'm looking to compute the k-th percentile value of some bigWig files (ChIP-seq) across the whole genome. For example, I'd like to know the 99.99th percentile value of a bigWig file across the whole genome.

I'm aware of wiggletools histogram (unfortunately does not provide exact percentile values, which I need) and the BEDOPS answer suggested in this question. However, the BEDOPS answer only computes percentile values on a per-chromosome basis, whereas I need whole-genome percentile values. I have tried to modify it to compute percentile values on a whole-genome basis to no avail.

If anyone has any ideas, I would appreciate them!

Thanks,

Rachel

ChIP-Seq bigwig wiggle statistics • 1.5k views

ADD COMMENT • link updated 5.6 years ago by Alex Reynolds 36k • written 5.6 years ago by rachel ▴ 30

score 2 · Accepted Answer · 2019-12-09

Unless I'm missing something about the question, here are a couple possibilities:

Perhaps use a pseudochromosome name for all records:

$ bigWigToBedGraph signal.bw signal.bg
$ awk -vFS="\t" -vOFS="\t" '{ $5 = $4; $4 = $1; $1 = "chrPseudo"; print $0; }' signal.bg | sort-bed - > signal.pseudoChr.bed
$ echo -e "chrPseudo\t0\t248956422" > chrPseudo.extents.bed
$ bedmap --echo --kth 0.9999 chrPseudo.extents.bed signal.pseudoChr.bed > answer.bed

I'm assuming that the ChIP-seq input comes from hg38, which is why I create an "extents" file that extends over the longest chromosome in the real hg38 assembly, i.e., chr1. Whatever intervals come out of that awk statement, they will never extend past the longest, real chromosome. If you're not using hg38, you could change 248956422 to match the longest chromosome in your assembly.

That's a pretty convoluted approach.

Another option might be to just strip all the signal from signal.bg into its own text file, and run that file through R.

$ bigWigToBedGraph signal.bw signal.bg
$ cut -f4 signal.bg > signal.txt
$ R
> d <- read.table("signal.txt")
> quantile(d$V1, c(0.9999))
...some answer...

Running ?quantile in R should give more detail about that command.

Even though I'd love for everyone to use BEDOPS, using R or even Python numpy.quantile would probably be faster than rejigging a bigWig into a resorted BED file with a fake chromosome name.

Hope this helps!