Question

Correlation Plot for bed files

0

Entering edit mode

3.9 years ago

christoph.neu ▴ 10

Hey,

I am looking for a tool, which calculates a correlation coefficient for a set of bed files. Preferably it would also provide a visualization for it.

Similarly to what the combination of "multiBamSummary" and "plotCorrelation" from deeptool does for bam files.

e.g. between all bedfiles from this roadmap folder:

wget ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/roadmapepigenomics/by_sample/H1_cell_line/H3K4me3/*.bed.gz"

I could use

bedtools jaccard

but i thought their might already be a tool which deals with the looping and visualization.

bed visualization • 3.0k views

ADD COMMENT • link updated 3.9 years ago by A. Domingues ★ 2.7k • written 3.9 years ago by christoph.neu ▴ 10

0

Entering edit mode

Can you provide example data?

ADD REPLY • link 3.9 years ago by ATpoint 81k

0

Entering edit mode

3.9 years ago

Alex Reynolds 35k

I have a set of wrapper scripts around a Pearson r correlation Rscript, which is called on N(N-1) pairwise combinations of files, for each chromosome (or all chromosomes), with jobs scheduled on a Slurm-based cluster.

The main script is called pearson.py:

https://resources.altius.org/~areynolds/public/cor/src/pearson.py

Here's the makefile showing how it is used:

https://resources.altius.org/~areynolds/public/cor/src/makefile

The --by-chr option does pairwise comparisons on a per-chromosome basis. This can be replaced with --all to run a distance or similarity metric on whole-genome signal.

It goes without saying that the signal in pairs of your input files should be binned identically, so that correlation is run on vectors of equal length.

If you don't want Pearson r, you could edit line 117 of https://resources.altius.org/~areynolds/public/cor/src/chromCor2 to replace cor() with another distance or similarity metric function call.

The call to pearson.py relies on a metadata table file, containing label and path information in the second and fourth columns.

Here's an example of what that file looks like:

https://resources.altius.org/~areynolds/public/cor/data/set1.table.txt

The second column should contain a unique label for the file specified in the fourth column.

We use Starch files to save disk space, but you can specify gzipped files in the fourth column.

If you use gzip files, you must:

Modify lines 54-80 in pearson.py to no longer test if the input files are genuine Starch archives.
Modify the call to chromCor2 in pearson_slurm.sh, replacing -s with the -z option, in order to specify gzipped input, instead of Starch input.

Once you run things and have score results, a summary file called scores.mtx is created in the ../results directory.

For visualization, I have a script called mtx2json.py (https://resources.altius.org/~areynolds/public/cor/src/mtx2json.py) that converts a scores.mtx file to a scores.json file.

You can call this by hand, e.g.:

$ ./mtx2json.py < /path/to/scores.mtx > /path/to/scores.json

This JSON file can be brought into my checkerboard tool: https://tools.altiusinstitute.org/checkerboard/

In the checkerboard tool, you paste in a JSON file of pairwise comparison scores, import it, select the desired categories ("labels" from the second column of the metadata table), and render the selection to an SVG-formatted figure.

You could also take the scores.mtx file and use that directly with any number of custom visualization scripts. The mtx2json.py step is for convenience.

ADD COMMENT • link 3.9 years ago by Alex Reynolds 35k

0

Entering edit mode

thanks for the answer. However, I don't have a slurm cluster and running chromCor2 directly on starch or bed files gives me a dimension error for rds

ADD REPLY • link 3.9 years ago by christoph.neu ▴ 10

0

Entering edit mode

Signal vectors need to be of equal length for correlation to work. Maybe the other solution can preprocess files to make them fit, but if you run into a similar error there, you might look at a pair of files giving problems.

ADD REPLY • link 3.9 years ago by Alex Reynolds 35k

score 3 · Accepted Answer · 2020-05-29

I think intervene might do the job. Look for the options in the module intervene pairwise: https://intervene.readthedocs.io/en/latest/examples.html#pairwise-module-examples

Intervene provides an easy and automated interface for effective intersection and visualization of genomic region sets or lists of items, thus facilitating their analysis and interpretations. Intervene contains three modules.
   -  venn to compute Venn diagrams of up-to 6 sets
    - upset to compute UpSet plots of multiple sets
    - pairwise to compute and visualize intersections of genomic sets as clustered heatmap.
  
https://intervene.readthedocs.io/

Intervene’s pairwise module provides several metrics to assess intersections, including number of overlaps, fraction of overlap, Jaccard statistics, Fisher’s exact test, and distribution of relative distances. Moreover, the user can choose from different styles of heat maps and clustering approaches. https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-017-1708-7

It is also available via a website so you can play with it before installing: https://asntech.shinyapps.io/intervene/

I am not in any way associated with this tool, I just find it very useful.