Interpreting probes in Illumina 450k array
1
0
Entering edit mode
3.3 years ago

I am analysing the 450K DNA methylation data from TCGA(GDC). I am new to this analysis and I had a basic doubt. Looking at rowData of the summarized experiment obtained from TCGABiolinks basically at the CpG probe data.frame, there are few things that I find confusing. First a single CpG probe is getting mapped to same gene multiple times that is specified by the Gene_Symbol column. I interpret as these are due to different exons for the gene. But then what should I interpret as the position of the CpG site w.r.t TSS even though that CpG maps to the same gene but each has a different position.

Second there are many CpG probes that map to more than one gene or other elements. Would it be preferable in this case to remove such CpG sites. A count of CpG sites that map to more than 1 gene or other mRNA, yielded more than 100K such probes.

Thanks in advance for any help in this regard.

DNA methylation Illumina 450K CpG • 1.8k views
1
Entering edit mode
3.3 years ago

Hello Noor Pratap,

Regarding multiple probes targeting the same gene: it is well documented that each gene can have multiple promoters, each of which can be affected by methylation at CpG islands. Whilst not all promoters contain CpG islands, the majority have been found to harbour these.

Regarding the probes that target more than 1 gene, these are arguably the most interesting and they most likely relate to:

• bimodal promoters, i.e., promoter regions from which transcription is initiated in both directions along the DNA strand
• transcripts that overlap each other, such as antisense transcripts, non-coding RNAs, or even protein coding transcripts that overlap and share promoters.

Transcriptional regulation is, of course, not as simple as 1 promoter : 1 TSS : 1 gene.

Trust that this helps.

Kevin

0
Entering edit mode

Hi Kevin, Thanks for the reply. I understand both the points. However papers I came across were using a gene level methylation value for integrating expression and methylation data. An approach listed was to compute average of all CpGs within 1500 BPs of TSS. However since a single CpG maps to same gene multiple times each having different start sites is not straightforward as I thought it would be (read is getting aligned to different exons of the same implying the sequence is preserved across the different exons perhaps). Anyways I shall try to figure out. Thanks for the help.

0
Entering edit mode

thank you for you answer Kevin. I have question. should I extract mean value for multiple probes targeting the same gene? and also, second question. should I consider the same beta values for multiple genes targeting a probe ID?

0
Entering edit mode

thank you for you answer Kevin. I have question. should I extract mean value for multiple probes targeting the same gene?

I am not sure because methylation at different parts of the gene can have different effects. It may be a 'case by case' basis. By averaging the values, you may be losing some important signal. You could check the probes in UCSC Genome Browser to see where exactly they are targeting.

and also, second question. should I consider the same beta values for multiple genes targeting a probe ID?

Oh yes, the beta value should be the same. I would keep these as single entities, though. So, the record could be:

chr  pos   beta gene
chr1 12345 0.9  gene1;gene2

<h5>#</h5>

Generally, in methylation studies, in my opinion, the data should always be kept at the level of probes, not genes.