Question

how to check if DE genes are enriched in custom genomic regions derived from epigenetic data

0

Entering edit mode

8 months ago

Al • 0

I haven't found any relevant posts, so here is my question (it's more a question on statistics I believe):

I have 2 samples to compare ( control vs condition), and I had RNA-seq data that I analysed using Deseq2 to find differentially expressed genes (DEGs) . Moreover, I have some epigenetic data, from which I derived characteristic groups of chromatin ( group a, group b, group c). In other words, I have bed files of the genomic regions of different sizes that fall into these 3 groups. The full genome is divided in these 3 groups, but they have different sizes ( both individual regions might vary in size and the fraction of genome per group).

Now my question is: How to check if there is more DEGs present in group a compared to other groups? I really don't know how to approach this question from the statistical point of you. There are multiple factors I am confused about:

groups (a,b,c) are not equal in size and they are not evenly distributed across the genome, so just checking if there is more DE genes in one group over another wouldn't be accurate because it doesn't take it into the account;
it is possible that there are in general more genes present in one group compared to another, and I don't know how to take it into the account for this neither.

I consider 1000 DEGs for analysis, and it is a human genome. I look at both protein coding and not coding genes. I would probably also be interested in seeing if there are different patterns for up-regulated and down-regulated genes.

Could anyone help me to understand how I should approach this question? Should I perform some sort of statistical testing here, if yes what kind? Could anyone point me to some resources (books, tutorials, courses etc) that could give me an idea how to approach this kind of questions. What section of statistics I should dive into?

I don't know if this could be considered multiomics data integration, but if yes I would be glad for recommendations of the materials on this subject too.

Many thanks for all the replies and help, hope my question was clear enough!

statistics multiomics epigenetics • 747 views

ADD COMMENT • link 8 months ago by Al • 0

score 2 · Answer 1 · 2023-08-13

Al ,

first head to the msigdb website, and take a look at the annotations that belong to set C1: https://www.gsea-msigdb.org/gsea/msigdb/human/collections.jsp#C1. Note that, while C1 contains gene sets defined based on position, the others are quite different (e.g. C2 is "curated", C8 is "cell type signatures") and so forth. the key here is to realize that what you are describing is just another way of defining a gene set...and as such is subject to the same types of methods others have used for gene sea enrichment analysis, or GSEA.

step 1, define gene sets: It looks like youve gotten this far. you need to first construct your gene sets of interest by dividing all genes into an in and outgroup. this is very similar to what was done in C1, except instead of lists of genes demarcated by cytobands, you are thinking of defining groups of genes based on genomic positions corresponding to epigenetic states from your data. as you have aptly put, the number (and identity) of genes not belonging to this group are important too, more on that below.

step 2, select a statistical methodology: you are describing enrichment testing; among the simplest instantiations of which is the hypergeometric test. this is a good place to start reading. over time, people began to further adapt and tailor these; for instance, the bioinformatics/biostatitical literatures detail weighted hypergeometric tests, directed hypergeometric tests, etc.; granted your treatment of the subject above, these may be of some interest to you. ultimately, the field converged on a small number of relatively robust gene set enrichment tests, for instance gsea, and others.

step 3, select an implementation of the method chosen in 2, or implement your own: im going on record here as saying you should not implement something like this on your own - there is just no reason to. this is a well-studied problem, and very well articulated software packages already implement the tests you'd likely choose. as a training exercise could be worthwhile, but if your goal is just to get the science done correctly, id recommend choosing a well-vetted implementation rather than re-inventing the wheel. others have stronger ideas on what the best implementations are than i do; for instance ATpoint has written very thorough posts on this in the past.

further, more sophisticated analyses are certainly possible, based on network theory (e.g. aracne) or based on other ideas from integrated omics. For the latter, although this link pertains specifically to single cell data, it should give you some ideas nonetheless.

that help?