different sample size between TCGA portal and TCGAbiolinks package
1
0
Entering edit mode
6 months ago
tyasird ▴ 10

I was looking for the mutation data through TCGA portal using TCGAbiolinks and I have realized that sample size are not the same.

for instance TCGA-OV case TCGA data portal shows 419 cases, however TCGAbiolinks shows 462 samples. File counts are the same for both it is 482.

so why it is different?

this my query in TCGA data portal:

cases.project.project_id in ["TCGA-OV"] and files.analysis.workflow_type in ["Aliquot Ensemble Somatic Variant Merging and Masking"] and files.data_category in ["Simple Nucleotide Variation"] and files.data_type in ["Masked Somatic Mutation"]

enter image description here

this is same query in the TCGAbiolinks package:

#query
query <- GDCquery(
  project = "TCGA-OV", 
  data.category = "Simple Nucleotide Variation", 
  access = "open",
  data.type = "Masked Somatic Mutation", 
  workflow.type = "Aliquot Ensemble Somatic Variant Merging and Masking"
)

#download & read
GDCdownload(query)
maf <- GDCprepare(query)
mafr = maftools::read.maf(maf)
mutations = mafSummary(mafr)
print(as.numeric(mafr@summary[mafr@summary$ID=="Samples"]$summary))

enter image description here

mutation tcga tcgabiolinks • 872 views
ADD COMMENT
0
Entering edit mode

You're comparing samples to cases. Can you check aliquot counts in both cases?

ADD REPLY
0
Entering edit mode

I thought 482 files = aliquots, isn't it like that? Or in another way to ask how can I find the sample number of given TCGA query in the portal? this is the query link TCGA-OV

ADD REPLY
1
Entering edit mode

I'm not entirely sure that num_files would equal num_aliquots. Please try and dig deeper to check if that's the case. I apologize, but I don't have the time to do a TCGA deep dive right now.

ADD REPLY
0
Entering edit mode
6 months ago
Zhenyu Zhang ★ 1.2k

In the GDC query, you got 419 cases and 482 files (likely 482 aliquots). In the tcgabiolinks query, you got 462 samples. You are comparing apples to oranges.

ADD COMMENT
0
Entering edit mode

when I go into 419 cases I see it shows 418 females. It doesn't mean that this is 418 samples? If it is not, how I can get sample number from TCGA portal for this query TCGA-OV

ADD REPLY
1
Entering edit mode

Most of the cases in GDC have at least one tumor sample and one normal sample, and some could have more tumor samples such as metastasis and new primary, etc. So case count is not sample count.

In the link you have, there are only case tab and file tab. There are no summary tab for samples. If you really want to get samples, you can learn the GDC API, or add all files into cart, and download sample sheet from the cart.

ADD REPLY

Login before adding your answer.

Traffic: 1545 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6