Merge several bigWig files (with different intervals) into single GRanges object
2
1
Entering edit mode
16 months ago
Roman Hillje ▴ 60

Hi everybody,

I've been banging my head against the wall for a few hours and finally decided to ask for some help here. What I need to do is relatively straight-forward:

I have a number of bigWig files, which were generated by deepTools and belong to two experiments groups, that I would like to plot using Gviz (Bioconductor).

In order to properly use the grouping feature of Gviz, it seems that I need a single GRanges object for each group with a column for each replicate. Since the intervals in the different samples/replicates are different, they can't simply be merged, or at least I haven't figured out how to do it.

I tried to make a very simple example with two GRanges objects:

df_1 <- data.frame(
chr = 'chr1',
start = c(1,200),
end = c(100,300),
strand = '*',
score = c(10,5)
)

gr_1 <- makeGRangesFromDataFrame(df_1, keep.extra.columns = TRUE)
# GRanges object with 2 ranges and 1 metadata column:
#      seqnames    ranges strand |     score
#         <Rle> <IRanges>  <Rle> | <numeric>
#  [1]     chr1     1-100      * |        10
#  [2]     chr1   200-300      * |         5
#  -------
#  seqinfo: 1 sequence from an unspecified genome; no seqlengths

df_2 <- data.frame(
chr = 'chr1',
start = c(50,150),
end = c(150,250),
strand = '*',
score = c(2,4)
)

gr_2 <- makeGRangesFromDataFrame(df_2, keep.extra.columns = TRUE)
# GRanges object with 2 ranges and 1 metadata column:
#       seqnames    ranges strand |     score
#          <Rle> <IRanges>  <Rle> | <numeric>
#   [1]     chr1    50-150      * |         2
#   [2]     chr1   150-250      * |         4
#   -------
#   seqinfo: 1 sequence from an unspecified genome; no seqlengths

The output I would like to have is this:

chr    start end score_1 score_2
chr1       1  50      10       0
chr1      50 100      10       2
chr1     100 150       0       2
chr1     150 200       0       4
chr1     200 250       5       4
chr1     250 300       5       0

In words, it is the minimum number of rows that still correctly reflects the "score" for both samples.

Is there a way to do this?

I would appreciate any kind of help or suggestions.

Best, Roman

ChIP-Seq R Gviz GenomicFeatures rtracklayer • 689 views
1
Entering edit mode
16 months ago

You could do something efficiently via bedmap on the command-line:

$bedmap --echo --echo-map-score --delim '\t' <(bedops --partition A.bed B.bed) <(bedops --everything A.bed B.bed) > answer.bed Here, the reference set (from which intervals are derived) is the disjoint partition of sets A and B. The map set (from which scores are derived) is the multiset union of sets A and B. To get sorted BED files (to make sets A, B, etc.), you could convert bigWig to Wiggle, and then convert Wiggle to BED:$ bigWigToWig in.bw out.wig
$wig2bed < out.wig > out.bed ADD COMMENT 0 Entering edit mode Thanks for the suggestion, I'll look into that in a bit more detail tomorrow. Perhaps I should mention, however, that I'll be working on at least 16 profiles (in groups of 4), and ideally 108 profiles in total. It's no problem at all to plot them as separate tracks (actually I'm impressed by how fast it's done), but I can't figure out how to group them together. I'll should probably also reach out to the Gviz developer(s) directly to see if they have any suggestion. ADD REPLY 0 Entering edit mode 16 months ago enho ▴ 20 You can make a Granges object with the tile width that you want, then by using GenomicRanges::findOverlaps() find the score value for each one. First you need to define your sequence info for genomic ranges object: seqi = Seqinfo("chr1",1000) # Seqinfo object with 1 sequence from an unspecified genome: # seqnames seqlengths isCircular genome # chr1 1000 NA <NA> More practically when using genome data, you can get your sequence info this way: ## for hg19 seqi_19 = SeqinfoForUCSCGenome("hg19") ## for hg38 seqi_38 = SeqinfoForUCSCGenome("hg38") # Seqinfo object with 595 sequences from hg38 genome: # seqnames seqlengths isCircular genome # chr1 248956422 <NA> hg38 # chr2 242193529 <NA> hg38 # chr3 198295559 <NA> hg38 # chr4 190214555 <NA> hg38 # chr5 181538259 <NA> hg38 # ... ... ... ... # chrUn_KI270753v1 62944 <NA> hg38 # chrUn_KI270754v1 40191 <NA> hg38 # chrUn_KI270755v1 36723 <NA> hg38 # chrUn_KI270756v1 79590 <NA> hg38 # chrUn_KI270757v1 71251 <NA> hg38 After that, you just need to make a new Granges object: gr_comb = tileGenome(seqi,tilewidth = 50000,cut.last.tile.in.chrom = TRUE) # GRanges object with 20 ranges and 0 metadata columns: # seqnames ranges strand # <Rle> <IRanges> <Rle> # [1] chr1 1-50 * # [2] chr1 51-100 * # [3] chr1 101-150 * # [4] chr1 151-200 * # [5] chr1 201-250 * # ... ... ... ... # [16] chr1 751-800 * # [17] chr1 801-850 * # [18] chr1 851-900 * # [19] chr1 901-950 * # [20] chr1 951-1000 * # ------- # seqinfo: 1 sequence from an unspecified genome And then just find the overlap (Which tells you which regions from first Granges overlap with which regions from the second Granges): overlap.1 = findOverlaps(gr_comb,gr_1) # Hits object with 5 hits and 0 metadata columns: # queryHits subjectHits # <integer> <integer> # [1] 1 1 # [2] 2 1 # [3] 4 2 # [4] 5 2 # [5] 6 2 # ------- # queryLength: 20 / subjectLength: 2 overlap.2 = findOverlaps(gr_comb,gr_2) # Hits object with 6 hits and 0 metadata columns: # queryHits subjectHits # <integer> <integer> # [1] 1 1 # [2] 2 1 # [3] 3 1 # [4] 3 2 # [5] 4 2 # [6] 5 2 # ------- # queryLength: 20 / subjectLength: 2 Finally, you just need to fill out the scores in the new Granges object: gr_comb$score_1 = NA_real_ ## or you can use 0
gr_comb$score_2 = NA_real_ ## or you can use 0 gr_comb[queryHits(overlap.1)]$score_1 = gr_1[subjectHits(overlap.1)]$score gr_comb[queryHits(overlap.2)]$score_2 = gr_2[subjectHits(overlap.2)]\$score
gr_comb
# GRanges object with 20 ranges and 2 metadata columns:
#        seqnames    ranges strand |   score_1   score_2
#           <Rle> <IRanges>  <Rle> | <numeric> <numeric>
#    [1]     chr1      1-50      * |        10         2
#    [2]     chr1    51-100      * |        10         2
#    [3]     chr1   101-150      * |      <NA>         4
#    [4]     chr1   151-200      * |         5         4
#    [5]     chr1   201-250      * |         5         4
#    ...      ...       ...    ... .       ...       ...
#   [16]     chr1   751-800      * |      <NA>      <NA>
#   [17]     chr1   801-850      * |      <NA>      <NA>
#   [18]     chr1   851-900      * |      <NA>      <NA>
#   [19]     chr1   901-950      * |      <NA>      <NA>
#   [20]     chr1  951-1000      * |      <NA>      <NA>
#   -------
#   seqinfo: 1 sequence from an unspecified genome

NOTE: In your example for the desired results, you have defined two values for regions like chr1:50, chr1:100, ... This method gives you results for every 50bp, which I guess is what you meant.