Question: NAs in TCGA methylation 450k beta data
17 months ago by
New York, USA
I have a question about TCGA methylation 450K data.

When you look at the TCGA methylation beta values,

Level 2 data has all the values, but I found many Level 3 probes have NAs (e.g., cg00000108, cg00000109, etc).

Level 2

Composite Element REF Methylated_Intensity Unmethylated_Intensity Detection_P_value

cg00000029 2488.00579881129 2281.3142892634 0

cg00000108 8943.62421381116 336.745081332759 0

cg00000109 3827.0493383932 219.47270455192 0

cg00000165 263.820225926362 2355.4623873349 0

cg00000236 3733.92206994152 722.124674419151 0

Level 3

Composite Element REF Beta_value Gene_Symbol Chromosome Genomic_Coordinate

cg00000029 0.521668865344633 RBL2 16 53468112

cg00000108 NA C3orf35 3 37459206

cg00000109 NA FNDC3B 3 171916037

cg00000165 0.100722321673368 1 91194674

cg00000236 0.837944995677383 VDAC3 8 42263294

There are so many NAs and I wonder why.

I thought they were filtered out because of detection p-value but when I downloaded the IDAT files and calculated detection p-values, they were all below than 0.01. So, they were not filtered out because of detection p-value.

Additionally, they are not on the chrX/Y, they are not SNPs, they are not cross-reactive probes.

There are ~90k NAs per sample. Almost 1/5 of 450k.

Why there are so many NAs in the 450k methylation beta data?

And does anyone know how they normalized the data from raw IDAT files?

I searched hard but couldn't find..

17 months ago by
Charles Warden7.9k
Duarte, CA
Illumina has a detection p-value.

I think this is useful, but some people either use normalization that doesn't provide NA values (or run some sort of imputation).

If you average methylation among sites in a region / CpG island, this may be one way to get around the NA value (or run a test that filters NA values, with an understanding you may get more false positives among test results where the sample size was decreased due to NA values).

There were some TCGA samples with a high percentage of missing values, and I would recommend removing those.

Otherwise, I think most should have a much lower fraction of missing values. For example, the typo will be part of a corrigendum (which I think will be out this month), but I believe I only removed one breast cancer TCGA sample due to concerns about the high frequency of missing values:

For some things, I agree that getting access to .idat files may be useful, but you will probably still want to filter out some samples with relatively high frequencies of NA values.

I already tested the data with detection p-values.. (mentioned in the question) The NAs were produced not because of detection p-values. And it is not a problem of few samples. All samples in TCGA COAD methylation beta values have ~90k NAs - 1/5 of 450K.

I think I've only checked the missing values for the BRCA 450k data.

It's possible that some batches had bigger problems than others, but it is hard for me to say for certain.

Are the the probes random, or do certain probes tend to be missing more often than others?

If I look at APC in the Xena Browser for GDC TCGA Colon Cancer (COAD) (or TCGA Colon Cancer (COAD)), a vary large portion of the 450k arrays were flagged as missing. I think that is actually closer to 1/3 rather than 1/5 missing/filtered samples, but I don't know if there were also a noticeable amount of samples that didn't have 450k arrays for that cancer type (although I think that part could be determined through the GDC). I don't think TCGA is doing quality filtering for that they make available (so other people can make their own assessments).

So, with a quick assessment, I think that could match what you are saying about needing to remove a lot of COAD 450k arrays due to quality filters.

I also apologize that I am not answering your question about the alternative causes for the missing probes, but I think this is a good question.

I also asked this to GDC, but they didn't know the details as well. I'm just guessing there were some kind of filtering processes but don't know what exactly they are. Remains a mystery.

