Question

wig2bed and counting the reads in the peaks

1

Entering edit mode

18 months ago

Bogdan ★ 1.4k

Dear all,

I would appreciate having your suggestions about ab analysis that I am performing.

I do have several wig files (A, B, C, D..) , and a list of peaks (P).

The goal is to be able to count the reads in the peaks listed in P in such a way that I can use the results for an analysis of differential binding (with DiffBind, edgeR, DESeq2, etc).

I have used the following two commands :

<1> wig2bed (bedops package) in order to transform the wig files into bed.

wig2bed -x < "${FILE%.bigwig}.wig"  > "${FILE%.bigwig}.bed"

<2> coverageBed --counts (bedtools package)

coverageBed -counts -a PEAKS -b "${FILE%.bigwig}.bed"

The question is the following : what is the proper way to compute the counts in the peaks (P), given the fact that the command wig2bed produces a bed file that has 5 fields (chr, start, end, ID, score), as it shown below. More specifically, how shall I convert the scores (column 5) into peak intensity (counts) ?

1       794250  794350  id-301  1.533330
1       794350  794400  id-302  3.066660
1       794400  794450  id-303  1.533330
1       794450  795800  id-304  0.000000
1       795800  795900  id-305  1.533330
1       795900  795950  id-306  0.000000
1       795950  796000  id-307  3.066660
1       796000  796350  id-308  1.533330
1       796350  796500  id-309  0.000000
1       796500  796550  id-310  1.533330
1       796550  796650  id-311  0.000000
1       796650  796700  id-312  1.533330
1       796700  796750  id-313  4.599990

Thank you.

Bogdan

bedtools ChIP-seq deeptools bedops • 720 views

ADD COMMENT • link updated 10 months ago by Ram 43k • written 18 months ago by Bogdan ★ 1.4k

score 2 · Answer 1 · 2022-10-20

That information is lost. Look at this example:

|               ---- |
|  -------------     |
| -------------      |
|   -------------    |
|      ------------  |
|01233344444444332210|

The dashed horizontal lines are reads and then numbers indicate coverage within that peak, so basically a bedGraph/Wig-like format. The correct count for that peak would be 5 ("correct" in terms of how tools like featureCounts would return it) as it intersects with 5 reads, but there is no way you could know that it is 5 by just looking at the coverage values. Obviously the max of the coverage would be 4 but not 5, the average would not be 5 either, nor would the median be 5. Depending on read length and read distribution within the peak and soft-clipping of reads by the aligner, any estimates based on coverage could wildly be off. I personally would not start from track files for a differential analysis, far too much uncertainty. In the end you anyway need raw data for publication or to ensure reproducibilty of the entire analysis from fastq to figures.

If you google you may find this but this is still only an estimate, the read length in this approach is assumed to be known and constant and depending on trimming of data and soft-clipping of the aligner I personally would assume that results will be approximate at best, but I have not read the linked approach in all detail, nor tested it so take it with a grain of salt.