Testing similarity between functional genomic annotations
2
2
Entering edit mode
24 months ago
rodd ▴ 150

Hi folks!

I have 1000+ binary vectors of the same length that consist of 0/1 positions corresponding to variants in the human genome and their functional annotation (i.e. transcription factor binding sites, protein-coding regions, methylation sites, etc.). Thus, each vector differs in the order, frequency, and distribution of 0s and 1s. [These correspond to functional genomic annotations of the human genome based on the ENCODE data, the Roadmap epigenomics data, and other studies, downloaded with the Garfield package].

Example vectors (with a length of 40 bases):

Transcription factor 1 binding sites: 1111111111101000000000000000000000000000
Protein coding region:                0000000000000000000000000000001111111111
DNA methylation sites:                1111111111111000000100000000000000000000
H3K36me3 sites:                       0000011100000111100000001011100000000000


My question is: is there a way to calculate the similarity between these vectors? Would this be a simple correlation? Or maybe there is a tool that allows me to do this? I've been stuck on this for a while, and I thought that maybe you guys could share your thoughts, or redirect me to a relevant post?

Many thanks!

P.S.: Garfield calculates the enrichment of these annotations in GWAS results, but not the enrichment of these annotations in other annotations.

1
Entering edit mode
24 months ago

There are plenty of similarity measures for binary vectors. Check for example the R package proxy but there are also others such as Cohen's kappa. Which measure to use depends on what you're going to be doing with it and how you want to treat each bit of information.

0
Entering edit mode
22 months ago
rodd ▴ 150

I am calculating the similarity between long binary vectors (80 million in length) in R by importing my annotations as matrices (with each column corresponding to an annotation, and each row corresponding to the base pair position in the genome [0/1 if the SNP belongs to an annotation or not]). I am then running the cor(mydatamatrix) function, which results in a matrix containing the Pearson's correlation values across every pair of genomic annotations/columns. To obtain P-values of these correlations, I used:

library(qdapTools)
out <- v_outer(mydatamatrix, function(x, y) cor.test(x, y)[["p.value"]])


... and then adjusted the P-values using the Bonferroni method.

I could not use rcorr() from the Hmisc package, or corr.test() from the psych package, as I bumped into errors saying something like "This function is incompatible with large vectors". The package proxy may have been useful for this, but I did not use it because I believe it does not provide P-values.

This approach is far from perfect (I could not run all annotations at once because each genomic annotation category was too big and R would run out of memory, even in a cluster), but I am running around 300 annotations of the same category (with 256GB RAM). I know I could have trimmed the bases for linkage disequilibrium, but my approach worked quite nicely, with similar functional annotations (varying just in terms of cell lines studied) clustering together quite significantly, so I am sharing it here in case someone find it useful in the future.