Question

Merging and intersecting genomic ranges for an indefinite number of files

0

Entering edit mode

16 months ago

sorrymouse ▴ 120

I have data that looks like this:

Chr Start   End Strand  log.fc  PeakID  Annotation  Detailed Annotation Distance to TSS Gene Name   Gene Description    Gene Type
chr3R   28540016    28540264    C   1.646677294 S2_cell_5ug_1   5 UTR (NM_176575, exon 1 of 18) 5 UTR (NM_176575, exon 1 of 18) 184 Ppn Papilin proteinCcoding
chrX    3692952 3693300 +   1.641962533 S2_cell_5ug_2   5 UTR (NM_166984, exon 2 of 8)  5 UTR (NM_166984, exon 2 of 8)  3763    Mnt Mnt proteinCcoding
chr2L   9177079 9177677 +   1.607698006 S2_cell_5ug_3   5 UTR (NM_001201819, exon 2 of 14)  5 UTR (NM_001201819, exon 2 of 14)  1020    tai taiman  proteinCcoding
chr2R   23870524    23870872    +   1.5819139   S2_cell_5ug_4   5 UTR (NM_079109, exon 1 of 3)  5 UTR (NM_079109, exon 1 of 3)  425 ken ken and barbie  proteinCcoding
chr3L   539884  540582  C   1.569726505 S2_cell_5ug_5   5 UTR (NM_138193)   5 UTR (NM_138193)   381 klar    klarsicht   proteinCcoding

So it is tab separated but does have spaces within the columns.

I would like to be able to compare these ranges to other files of the same type and structure, but without being limited by the number of files. These are the steps I would like to take:

Make an index column from the first file that contains all the genomic ranges
Check against another file if the genomic ranges overlap
If it does append the data
If it doesn't, add the range to the 'index' column
Leave empty columns for the first set of data and append the second set.

So if I had a second file like so:

Chr Start   End Strand  log.fc  PeakID  Annotation  Detailed Annotation Distance to TSS Gene Name   Gene Description    Gene Type
chr3R   28540016    28540314    -   0.171281417 m6ace_S2_peaks_2039 5 UTR (NM_176575, exon 1 of 18) 5 UTR (NM_176575, exon 1 of 18) 43872   Ppn Papilin protein-coding
chr2L   9177129 9177677 +   0.399838989 m6ace_S2_peaks_53   5 UTR (NM_001201819, exon 2 of 14)  5 UTR (NM_001201819, exon 2 of 14)  34242   tai taiman  protein-coding
chr2R   23870474    23870922    +   0.238601528 m6ace_S2_peaks_875  5 UTR (NM_079109, exon 1 of 3)  5 UTR (NM_079109, exon 1 of 3)  37785   ken ken and barbie  protein-coding
chr3L   440634  441032  -   0.256231658 m6ace_S2_peaks_678  5 UTR (NM_001103992, exon 1 of 4)   5 UTR (NM_001103992, exon 1 of 4)   38067   klar    klarsicht   protein-coding
chr3L   391975  392173  -   0.110795274 m6ace_S2_peaks_3280 3UTR (NR_124740)    3UTR (NR_124740)    38065   trh trachealess protein-coding

Here is an example of how it would look:

Index_Chr   Index_Start Index_End   Chr Start   End Strand  log.fc  PeakID  Annotation  Detailed Annotation Distance to TSS Gene Name   Gene Description    Gene Type   Chr Start   End Strand  log.fc  PeakID  Annotation  Detailed Annotation Distance to TSS Gene Name   Gene Description    Gene Type
chr3R   28540016    28540264    chr3R   28540016    28540264    C   1.646677294 S2_cell_5ug_1   5 UTR (NM_176575, exon 1 of 18) 5 UTR (NM_176575, exon 1 of 18) 184 Ppn Papilin proteinCcoding  chr3R   28540016    28540314    -   0.171281417 m6ace_S2_peaks_2039 5 UTR (NM_176575, exon 1 of 18) 5 UTR (NM_176575, exon 1 of 18) 43872   Ppn Papilin protein-coding
chrX    3692952 3693300 chrX    3692952 3693300 +   1.641962533 S2_cell_5ug_2   5 UTR (NM_166984, exon 2 of 8)  5 UTR (NM_166984, exon 2 of 8)  3763    Mnt Mnt proteinCcoding                                              
chr2L   9177079 9177677 chr2L   9177079 9177677 +   1.607698006 S2_cell_5ug_3   5 UTR (NM_001201819, exon 2 of 14)  5 UTR (NM_001201819, exon 2 of 14)  1020    tai taiman  proteinCcoding  chr2L   9177129 9177677 +   0.399838989 m6ace_S2_peaks_53   5 UTR (NM_001201819, exon 2 of 14)  5 UTR (NM_001201819, exon 2 of 14)  34242   tai taiman  protein-coding
chr2R   23870524    23870872    chr2R   23870524    23870872    +   1.5819139   S2_cell_5ug_4   5 UTR (NM_079109, exon 1 of 3)  5 UTR (NM_079109, exon 1 of 3)  425 ken ken and barbie  proteinCcoding  chr2R   23870474    23870922    +   0.238601528 m6ace_S2_peaks_875  5 UTR (NM_079109, exon 1 of 3)  5 UTR (NM_079109, exon 1 of 3)  37785   ken ken and barbie  protein-coding
chr3L   539884  540582  chr3L   539884  540582  C   1.569726505 S2_cell_5ug_5   5 UTR (NM_138193)   5 UTR (NM_138193)   381 klar    klarsicht   proteinCcoding  chr3L   440634  441032  -   0.256231658 m6ace_S2_peaks_678  5 UTR (NM_001103992, exon 1 of 4)   5 UTR (NM_001103992, exon 1 of 4)   38067   klar    klarsicht   protein-coding
chr3L   391975  392173                                                  chr3L   391975  392173  -   0.110795274 m6ace_S2_peaks_3280 3UTR (NR_124740)    3UTR (NR_124740)    38065   trh trachealess protein-coding

Again, the most important thing would be being able to do this again and again with a third, fourth, fifth file..

I have trying to figure this in GenomicRanges in R but honestly I'm in over my head. If anyone can help it would be much appreciated.

miclip genomicranges • 938 views

ADD COMMENT • link updated 16 months ago by seidel 11k • written 16 months ago by sorrymouse ▴ 120

score 0 · Answer 1 · 2022-12-14

0

Entering edit mode

16 months ago

Alex Reynolds 35k

Most BEDOPS bedops operations will let you specify as many files as your operating system allows (ulimit -n typically defaults to 1021 available files, which can be adjusted), e.g., to merge N sorted files:

bedops --merge A.bed B.bed ... N.bed > answer.bed

Or to take the multiset union (different from a merge):

bedops --everything A.bed B.bed ... N.bed > answer.bed

With this pattern, you can build more complex mapping operations using process substitutions, e.g., consider the map file in a bedmap operation:

bedmap --echo --echo-map reference.bed map.bed > answer.bed

This can be directly replaced with the standard output from a bedops operation. For example:

bedmap --echo --echo-map reference.bed <(bedops --everything A.bed B.bed ... N.bed) > answer.bed

ADD COMMENT • link 16 months ago by Alex Reynolds 35k

0

Entering edit mode

These don't actually do what I want - the first is similar to GRanges reduce (I think) but does not output overlaps in the different datasets and the second does not indicate when ranges are overlapping it just combines all the data.

ADD REPLY • link 16 months ago by sorrymouse ▴ 120

0

Entering edit mode

It might help to go with simpler examples and to look at the options to bedops and bedmap, which can help with calculating or aggregating specific values from typical BED columns.

ADD REPLY • link 16 months ago by Alex Reynolds 35k

score 0 · Answer 2 · 2022-12-14

0

Entering edit mode

16 months ago

Matthias Zepper 4.5k

COITrees are a highly efficient data structure to accommodate a large (yet not indefinite) number of integer interval sets and run operations on them.

ADD COMMENT • link 16 months ago by Matthias Zepper 4.5k

score 0 · Answer 3 · 2022-12-14

Another option is to use PyRanges in Python, iterating over a list of files, e.g., something like:

import pyranges as pr

fns = ['A.bed', 'B.bed', ..., 'N.bed']
sets = [pr.read_bed(x) for x in fns]
merged = pr.concat(sets).merge(strand=False)

You can use foo.apply(fnct) to apply a function fnct on rows of a PyRanges object foo, which can contain custom Pandas and PyRanges operations to process rows against data in one or more Pandas dataframes or PyRanges objects.

score 0 · Answer 4 · 2022-12-14

It's not clear to me what you want to do, but it seems like GRanges has all the things I can imagine you would need, and would allow oyu to stay in one environment and test all your assumptions about how things are working. You can read in your files and makeGRangesFromDataFrame() while keeping all your extra columns. You can do any sort of manipulation you would like - intersect, reduce, identify non-overlapping regions (i.e. merge or intersect using any pattern you like). You can add columns to your GRanges objects arbitrarily, so you can keep track of segment IDs or create indexes as you wish. You can create new structures, and you can loop over any sets of structures using whatever logic you want to employ (thus apply to essentially an infinite number of files. As described, the notions of adding indexes, appending or leaving column spaces, suggests to me there's probably something more simple and fundamental that you just have to work out (IMHO).