Question: Whole-genome percentile values in bigWig files
2
gravatar for rachel
6 weeks ago by
rachel30
rachel30 wrote:

Hi all,

I'm looking to compute the k-th percentile value of some bigWig files (ChIP-seq) across the whole genome. For example, I'd like to know the 99.99th percentile value of a bigWig file across the whole genome.

I'm aware of wiggletools histogram (unfortunately does not provide exact percentile values, which I need) and the BEDOPS answer suggested in this question. However, the BEDOPS answer only computes percentile values on a per-chromosome basis, whereas I need whole-genome percentile values. I have tried to modify it to compute percentile values on a whole-genome basis to no avail.

If anyone has any ideas, I would appreciate them!

Thanks,

Rachel

ADD COMMENTlink modified 6 weeks ago by Alex Reynolds29k • written 6 weeks ago by rachel30
2
gravatar for Alex Reynolds
6 weeks ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Unless I'm missing something about the question, here are a couple possibilities:

Perhaps use a pseudochromosome name for all records:

$ bigWigToBedGraph signal.bw signal.bg
$ awk -vFS="\t" -vOFS="\t" '{ $5 = $4; $4 = $1; $1 = "chrPseudo"; print $0; }' signal.bg | sort-bed - > signal.pseudoChr.bed
$ echo -e "chrPseudo\t0\t248956422" > chrPseudo.extents.bed
$ bedmap --echo --kth 0.9999 chrPseudo.extents.bed signal.pseudoChr.bed > answer.bed

I'm assuming that the ChIP-seq input comes from hg38, which is why I create an "extents" file that extends over the longest chromosome in the real hg38 assembly, i.e., chr1. Whatever intervals come out of that awk statement, they will never extend past the longest, real chromosome. If you're not using hg38, you could change 248956422 to match the longest chromosome in your assembly.

That's a pretty convoluted approach.

Another option might be to just strip all the signal from signal.bg into its own text file, and run that file through R.

$ bigWigToBedGraph signal.bw signal.bg
$ cut -f4 signal.bg > signal.txt
$ R
> d <- read.table("signal.txt")
> quantile(d$V1, c(0.9999))
...some answer...

Running ?quantile in R should give more detail about that command.

Even though I'd love for everyone to use BEDOPS, using R or even Python numpy.quantile would probably be faster than rejigging a bigWig into a resorted BED file with a fake chromosome name.

Hope this helps!

ADD COMMENTlink modified 6 weeks ago • written 6 weeks ago by Alex Reynolds29k
1

I just wanted to say that your first option worked beautifully! (I already had everything set up to use bedops so it was easy to adapt to). Thank you so much for responding to my question :)

ADD REPLYlink written 6 days ago by rachel30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1595 users visited in the last hour