Question: TCGA: Discrepancy in mutation records numbers based on data acquistion method?
gravatar for sc
11 months ago by
sc10 wrote:

Hi all,

Why do I get different number of records for a gene mutation based on what method I use to obtain the data?

For example, if I wanted to know which patients had a KRAS mutation in the LUAD dataset from TCGA:

1) I could try below using GDCquery which gives me 117 records.

maf <- GDCquery_Maf("LUAD", pipelines = "muse")
maf_kras <- maf[which(maf$Hugo_Symbol == 'KRAS'),]

[1] 117

2) Using the MAF file from the analysis done at: tells me there are 161 mutated samples.

maf2 <- read.maf("./LUAD-TP.final_analysis_set.maf.txt")
gene_summary <- getGeneSummary(maf2)
gene_summary <- gene_summary[which(gene_summary$Hugo_Symbol == "KRAS"),]

[1] 161

And all 161 of these samples have a unique patient ID if we check the first 12 character patient ID:

kras_mutant_barcodes <- genesToBarcodes(maf = maf2, genes = "KRAS", justNames = TRUE)
kras_mutant_barcodes <- substr(unique(as.character(unlist(kras_mutant_barcodes))), start = 1, stop = 12)

[1] 161

Additionally, is it correct to assume that if a patient does not have a KRAS mutation record then they are considered to be a non-mutant?


mutation tcga R maf • 306 views
ADD COMMENTlink modified 11 months ago by Kevin Blighe59k • written 11 months ago by sc10
gravatar for Kevin Blighe
11 months ago by
Kevin Blighe59k
Kevin Blighe59k wrote:

While surprising for first-time users of these programs, it is not surprising to people like I who have already processed much of the TCGA data. TCGAbiolinks and GDAC (Broad Institute) can both be regarded as third parties, in terms of TCGA data housing. They will have pulled data at a specific time-point from the GDC (Genomic Data Commons) and processed/filtered it in a certain way. Keep in mind, in this regard, that the data at the GDC has been changing/updating over the past few years. It may prove a futile exercise to find out, therefore, the exact reasons behind the discrepancy.

Whenever I need TCGA data,I take it direct from the GDC and avoid the use of any third party, and I time stamp the download. MAF files are Level 3 (open access) at the GDC, but there may be more than 1 for a particular cancer, reflecting the fact that the sequencing and data processing was performed at different centres. Also, the same sample may have been sequenced at 2 or more centres - keep this in mind.

Additionally, is it correct to assume that if a patient does not have a KRAS mutation record then they are considered to be a non-mutant?

Possibly, or the depth of coverage may have been low over the region in one or more samples, and thus nothing was called. You wouuld have to obtain the original BAM files in order to obtain the complete picture.


ADD COMMENTlink written 11 months ago by Kevin Blighe59k

Hi Kevin,

Thanks for the detailed clarification, much appreciated!

ADD REPLYlink written 11 months ago by sc10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1222 users visited in the last hour