Why do I get different number of records for a gene mutation based on what method I use to obtain the data?
For example, if I wanted to know which patients had a KRAS mutation in the LUAD dataset from TCGA:
1) I could try below using GDCquery which gives me 117 records.
library(TCGAbiolinks) library(maftools) maf <- GDCquery_Maf("LUAD", pipelines = "muse") maf_kras <- maf[which(maf$Hugo_Symbol == 'KRAS'),] length(rownames(maf_kras))  117
2) Using the MAF file from the analysis done at: http://gdac.broadinstitute.org/runs/analyses__2016_01_28/reports/cancer/LUAD-TP/MutSigNozzleReport2CV/nozzle.html tells me there are 161 mutated samples.
maf2 <- read.maf("./LUAD-TP.final_analysis_set.maf.txt") gene_summary <- getGeneSummary(maf2) gene_summary <- gene_summary[which(gene_summary$Hugo_Symbol == "KRAS"),] gene_summary$MutatedSamples  161
And all 161 of these samples have a unique patient ID if we check the first 12 character patient ID:
kras_mutant_barcodes <- genesToBarcodes(maf = maf2, genes = "KRAS", justNames = TRUE) kras_mutant_barcodes <- substr(unique(as.character(unlist(kras_mutant_barcodes))), start = 1, stop = 12) length(unique(kras_mutant_barcodes))  161
Additionally, is it correct to assume that if a patient does not have a KRAS mutation record then they are considered to be a non-mutant?