Question: Analysis of DNA Methylation data from TCGA
gravatar for adriankoh002
6 days ago by
adriankoh0020 wrote:

Hi all,

I am new to bioinformatics and am working on a school ML project that seeks to use the DNA Methylation data extracted from TCGA, to identify cancerous tissues from normal/control tissues, as well as distinguish between the different types of cancer (i.e. BRCA v.s COAD).

I have downloaded the "Illumina Human Methylation 450" level 3 data-set and will like to know how can I tell if a sample is cancerous, or normal/healthy since it is not indicated and I will require the information, in order to create a training data set.

The headers included in the data-sets which I have downloaded are as follows:

1) Composite Element REF 2) Beta_Value 3) Chromosome 4) Start 5) End 6) Gene Symbol 7) Gene Type 8) Transcript ID 9) Position to TSS 10) CGI Coordinate 11) Feature Type

This seems to be quite different from the headers present for another similar study ( as they have a header which states the class of the tumor (normal or cancerous).

Thus, can I check if the "class" header is present on level 2 or level 1 data instead? Or is it distinguished via the assigned TCGA barcode number:

enter image description here

Moreover, I am currently analyzing the .txt data via excel, thus are there any other methods available instead? Any help will be greatly appreciated as I am currently quite lost, and worried about the lack of progress.



tcga dna methylation • 101 views
ADD COMMENTlink modified 6 days ago by Kevin Blighe35k • written 6 days ago by adriankoh0020
gravatar for Kevin Blighe
6 days ago by
Kevin Blighe35k
Republic of Ireland
Kevin Blighe35k wrote:


Your file names should look something like this: jhu-usc.edu_UCEC.HumanMethylation27.3.lvl-3.TCGA-BK-A0CB-11A-33D-A10Q-05.gdc_hg38.txt. The part, TCGA-BK-A0CB-11A-33D-A10Q-05, is the full TCGA barcode and we can tell that this is a normal sample by the presence of the 11A. Converseky, this, jhu-usc.edu_UCEC.HumanMethylation450.12.lvl-3.TCGA-AX-A2HF-01A-11D-A17F-05.gdc_hg38.txt, is a tumour sample, due to 01A.

The encoding goes like this:

  • tumor types range from 01 - 09
  • normal types from 10 - 19
  • control samples from 20 - 29

It is thus possible to obtain a file listing of all files in R and the determine the tissue type by using a regular expression (regex) to pattern match on the TCGA barcode.

Once you identify tumour and normal samples, a possible analysis to perform is Wilcoxon Signed Rank test that compares the Beta (β) values in tumour versus normal, and derives a p-value for each gene. The difference in mean β (i.e. mean(tumour) - mean(normal)) should also be obtained.

It is generally not a good idea to perform analyses in Excel®.


ADD COMMENTlink written 6 days ago by Kevin Blighe35k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1213 users visited in the last hour