Forum: Data integration between gene expression - RNAseq and DNA methylation - Illumina 450k
gravatar for carlos_marchi
4.7 years ago by
Brazil, São Paulo
carlos_marchi60 wrote:

I'm beginner in the ​​bioinformatics area and I have some doubts on how to work with integration of biological data involving RNA-Seq and DNA methylation (Illumina 450k).

I have done some research on the internet and found several articles related to data integration process based on concatenation, but I've been difficulty in reproducing their experiments in order to learn how to manipulate the data.

I would like to integrate the RNA-seq data with DNA methylation. About the integration process, I imagine that is at GeneRef ID. But observing the methylation data, each sample containing multiple probes of the methylation levels for the genes. Therefore, there are cases where there are more probes for the same gene. Below is an example of DNA Methylation:

Heat: 6005486023_R04C02, IlmnID, CHR UCSC_RefGene_Name, UCSC_RefGene_Group, Relation_to_UCSC_CpG_Island

Data: 0.075187176583887, cg00000029,16, RBL2, TSS1500, N_Shore

I wonder what the treatment should be done to know the level of methylation of a gene having various probes. This right calculate the average of these probes? What is the right technique to get the methylation level for each gene?

I would like to generate a co-expression network just tumor data. Is this make sense?

Can anyone help me, please.

Thank you very much.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by carlos_marchi60
gravatar for carlos_marchi
4.7 years ago by
Brazil, São Paulo
carlos_marchi60 wrote:

Hello vchris_ngs,

I'm a little confused yet.

Looking for endorsement of illumina 450k have the following header:

IlmnID CHR UCSC_RefGene_Name UCSC_RefGene_Group Relation_to_UCSC_CpG_Island

Looking at the data, there are some probes in which columns values are absent. I'm not sure what that means. For example, the column '' UCSC_RefGene_Group '' not being filled denotes intergenic region, as for the column "Relation_to_UCSC_CpG_Island" missing values ​​mean opensea.

It also has missing values ​​for the column "UCSC_RefGene_Name" in this case I do not know what can be and if that is correct.

Here are some example of missing data:

cg00035864 Y TTTY18 TSS1500 cg00213748 Y S_Shore cg00479827 Y NCRNA00185 Body

Another doubt is regarding the list of ref genes which genes are separated by semi-colons. When this happens, there is a list of the genes region. This means for each gene there is a corresponding region of the gene? Example:

cg00272582 Y ZFY; RBMY2FP; RBMY1F, TSS1500; TSS200; TSS1500, Island

Would be as follows:


This means that for each identified gene has the same DNA methylation level and that this level are methylation regions of each gene: TSS200 and TSS1500 and all those containing CpG islands?

Finally, I thought of creating a matrix where there are 3 columns: id gene, expression level and level of methylation. So, apply some statistical method in this matrix so that then can generate a network. Remembering, that will generate a network of tumor status of patients.

gene id | gene expression value | methylation level

ZFY | 12 | 0.2 RBMY2FP | -10 | 0.8 RBMY1F | 5 | 0.1

Note that is apply log base 2 in the gene expression values and methylation level values are value-B

This would be correct? Or should I include more columns to respect the regions and the state of methylation of the genes? If so, How can i get levels of methylation for each gene?

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by carlos_marchi60

Sorry am a bit confused about the data you are trying to show. Are you using TCGA data or your own custom data? If TCGA data for both 450k and RNAseq take a look at this tool in link which shows how you can create matrices for RNAseq across samples and probes for 450k with intensity and also gene names. Then you can use them farther in R to create common probes from gene features across both platforms.

  1. If you are using your custom data of your lab then Did you analyze your 450k data and create a matrix from your samples with intesity values across samplss with row.names as probes? If so show them here and map them to various features. Show a head of the matrix with a proper formatting for me to read it.
  2. Can you show me your gene expressiom matrix with gene id across all samples?

In any case if you have both matrix then you should be able to merge them based on gene id once you annotate your probes from 450k to gene ids based on features you want to see. Either CpG sites /shores /islands/ , promoters, etc.

Please reformat your query and also add things I asked then I can help more.

For TSS200 and TSS1500 are distance from TSS and see which distances refers to as promoters, islands and shores.

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by ivivek_ngs5.1k


All data were extracted from TCGA, Tumor condition, breast cancer disease.

mRNA and miRNA expression data are the correct pattern. The level of these data are 3. They are in a matrix containing the values ​​of gene expression (mRNA and miRNA).

In the case of methylation data, patient samples were obtained from the level 1. Because of this, I needed apply some data analysis using pipeline from Illumina 450K platform. Thus, each sample containing multiple probes with the methylation levels of the genes. We obtain the level 1, because it has more information of CpG islands, Chromosome, gene region.

The problem is that I do not know how to work with the methylation data, as each patient sample contains several probes that may be associated with a single gene. How can I make the integration of these data?

I was thinking of generating an array for each methylation which has the expression levels of each gene. Another possibility is to create an matrix with the methylation levels of genes in the region of the gene: the promoter and body. In this case we would have a matrix containing the gene methylation levels in the promoter and the body region.

After each set of data having its own representation I would make the process of integration, resulting in a single matrix. This merge would be based on the id gene. After this, it would verify a statistical method to be applied in this matrix. Then, a network co-expression would be generated and topological analysis would be applied in this network trying to find genes associated with disease. The resulting in a gene prioritization process.

That future would walk for personalized medicine.

I based myself in the following articles:

So, I'd like to apply the concatenation and/or transformation-based integration or early and/or intermediated integration process.

ADD REPLYlink written 4.7 years ago by carlos_marchi60

I am not able to understand the motivation of why you are selecting the raw data here? You can directly work with data of 450k from TCGA, here is how you do it, below code from the tool TCGA2STAT Import methylation expression

DNA methylation profiles were obtained either via HumanMethylation450 BeadChip. The platform probes for 450k contains ~450,000 CpG sites. The package allows users to import either methylation data when available, via type="27K" (default) or type="450K" here we use 450K. For example:

or equivalently

methyl.ov <- getTCGA(disease="OV", data.type="Methylation")

Get 450K methylation profiles for ovarian cancer patients

methyl45.ov <- getTCGA(disease="OV", data.type="Methylation", type="450K")

Note that the matrix of the methylation profiles returned is NOT aggregated at the gene level. Each row in the data matrix (dat) is a probe from the methylation assay, which represents a CpG site. Since genes often contain more than one CpG site and each CpG site can differ significantly in the methylation level, gene level aggregation is less desirable. Hence, our package returns the probe-level data.

Look at the data

List of 4
 $ dat       : num [1:27578, 1:612] 0.7994 0.339 0.0281 0.6012 NA ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:27578] "cg00000292" "cg00002426" "cg00003994" "cg00005847" ...
  .. ..$ : chr [1:612] "TCGA-01-0628-11A-01D-0383-05" "TCGA-01-0630-11A-01D-0383-05" "TCGA-01-0631-11A-01D-0383-05" "TCGA-01-0633-11A-01D-0383-05" ...
 $ cpgs      :'data.frame':  27578 obs. of  3 variables:
  ..$ Gene_Symbol       : chr [1:27578] "ATP2A1" "SLMAP" "MEOX2" "HOXD3" ...
  ..$ Chromosome        : chr [1:27578] "16" "3" "7" "2" ...
  ..$ Genomic_Coordinate: int [1:27578] 28890100 57743543 15725862 177029073 148822837 93862594 93813777 11980953 89290921 0 ...
 $ clinical  : NULL
 $ merged.dat: NULL

               TCGA-01-0628-11A-01D-0383-05 TCGA-01-0630-11A-01D-0383-05 TCGA-01-0631-11A-01D-0383-05
    cg00000292                   0.79940858                   0.62039417                   0.91148841
    cg00002426                   0.33900444                   0.18030460                   0.29385049
    cg00003994                   0.02811930                   0.03607298                   0.06159123
    cg00005847                   0.60116497                   0.64955777                   0.47394908
    cg00006414                           NA                           NA                           NA
    cg00007981                   0.01881682                   0.01803597                   0.02904295

             Gene_Symbol Chromosome Genomic_Coordinate
cg00000292        ATP2A1         16           28890100
cg00002426         SLMAP          3           57743543
cg00003994         MEOX2          7           15725862
cg00005847         HOXD3          2          177029073
cg00006414 ZNF425;ZNF398          7          148822837
cg00007981         PANX1         11           93862594

The above example is for OV data you can do it for Breast. The last matrix shows probes to genes and you can tax this dataframe and merge probes with RNA-Seq data at gene level and build a big matrix with probes, gene id, chr, pos. gene exp (values for samples)

There is no point for generating and reworking unless you want to build some pipeline of your own or you want to use some custom methylation analysis pipeline. If you are not aware of how to analyze methylation data in that case you have to first learn what and how the data is generate and what tools are used for its analysis and then see which tools allows you to build such matrices with probes id mapped to gene id with normalized intensity values.

So as of now you should be able to work with the normalized beta values from the TCGA directly (level 3)

Hope it is clear

ADD REPLYlink modified 4.7 years ago • written 4.7 years ago by ivivek_ngs5.1k


That's clear for me now! I will try to do it. I think it will work for us.

So, about the motivations to get data is because the system that I've implemented in which the user provide the data to analyse. Then, the TCGA data has used just testing the system.

Thank you very much for all patience in explaining the concepts.

ADD REPLYlink written 4.7 years ago by carlos_marchi60

Just to add, if my answers and comments are useful for you then accept them as answers and also put upvotes so that people can use it for future reference.

ADD REPLYlink written 4.7 years ago by ivivek_ngs5.1k
gravatar for ivivek_ngs
4.7 years ago by
Seattle,WA, USA
ivivek_ngs5.1k wrote:

You can do it , take a look at this links below where you will find similar discussion. There is also a mention of a package in R that does that.

What is a typical workflow to correlate methylation and expression data?

There are two tools mentioned in the above link that can be used to exploit for your usage

  1. MethylMix


These should be able to clear your doubts about the integration methods.

So there are different strategies that can be implemented. DNA methylation probes are single bases association giving intensity values which are normalized to give beta values and based on threshold scores the spots are termed as hyper or hypo-methylated. Now you can identity these spots for CpG sites, CpG islands, CpG shores, gene body, promoters , etc and try to identify the corresponding genes from the annotation and then correlate these genes with the gene expression data or differentially expressed genes. You can also try to see if you have differentially methylated sites (at any level, or lets say shorten them to Promoter regions , retrieve gene names) and try to see if these genes are differentially expressed or not. This gives you an idea of hyper of hypo methylated spots that correlate or anti correlate with gene expression fulfilling your standard hypothesis.

About the network association it is a different story all together which will be farther downstream to you integration based on the hits and if that gets a considerable amount of DEGs(differentially expressed genes with both up and down) and specific pathway corroborating your hypothesis then you can show them in a network mapper or if your methylated genes have different levels of gene expression and are not significant DEGs you might want to infer co-expression taking those hits of genes with their expression values and throw them to package WGCNA for inferring co-expression network. (this approach might be very ambitious but you can give it a try depending upon the number of hits I am guessing about it is plausibility).

I hope this will help for a starting point and may be I am able to help with some pointers.

ADD COMMENTlink modified 4.7 years ago • written 4.7 years ago by ivivek_ngs5.1k


Thank you very much for the information. I have differentially expressed genes from Transcriptome analysis. And right now I'm working on Illumina methylation Data using EPIC (850k) data. I have the beta values.

So, now as you said based on cutoff (beta value 0.2) I have some probes like cg18478105, cg09835024 etc... From the Annotation I also have their symbols.

Now I want compare the expression values from Transcriptome analysis with methylation levels of each gene. Can I use MethylMix or COHCAP for that? As my methylation data is from EPIC platform I didn't find any proper tool. When I looked into COHCAP it is only for 27k and 450k.

Can you please help me in this? Any advice is greatly appreciated. Thank you in Advance

ADD REPLYlink written 4.6 years ago by Vasu530

I do not think it will be a problem for 850k data, I have not worked with 850k but usually they are just a deeper resolution of what you see now. If you have used 450k similarly you can work with 850k as well. Otherwise you can always take a look at rnbeads . Here you go with the analysis doc . You can try out all three and compare the results and see which gives you more suitable result based on your biological hypothesis. At the end of the day you can integrate probes with gene expression. You have to narrow them down from sites to islands, shores, promoters and this will then we narrowed down to gene symbols. Now these gene symbols will be a surrogate showing up hyper or hypo methylation across you category of interest or even differential methylated genes (coming from either of the one CpG islands, CpG shores or promoters) for your category of samples taken into consideration. Then you can look for these genes for their expression from transcriptomic data if you have a summarized matrix of fpkm or tpm expression values at gene level for all your samples. You can simply plot them to see how they behave or change. I think this should be clear if not then write a new question with clear motivation of what you wanted to perform and where you are facing the current challenge.

ADD REPLYlink written 4.6 years ago by ivivek_ngs5.1k


Could you please reply to my comment. Thank you.

ADD REPLYlink written 4.6 years ago by Vasu530
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1144 users visited in the last hour