Question: mismatch sample number TCGA ovary cancer between recount and GDC portal
gravatar for Brawni
12 months ago by
Brawni110 wrote:

Hello there!

I have been trying to understand what's going on with the RNA-seq TCGA ovary dataset as i got a different number of samples by using the recount R package and the GDC web portal. To be more specific, i get 430 tumoral samples using recount (so, no normal samples), and 379 samples from the GDC of which there are some normals, but i could not retrieve the biospecimen data for that corresponding sample set. Also, is the GDC portal only apparently friendly to use but instead very confusing?

Any clues?

Thanks a lot.

tcga recount gdc ovary • 407 views
ADD COMMENTlink modified 12 months ago • written 12 months ago by Brawni110

Actually, I find that the GDC Data Portal is 'better' than the third party resources, like TCGAbiolinks, recount, etc. By obtaining the data direct from GDC, one has more flexibility in analysing the data. Mismatches between these third party resources and the original data are always present. Can you show the commands that you used to obtain the data via recount? I looked just now and can see 380 HTSeq raw count samples on GDC Data Portal for ovarian cancer - DIRECT LINK.

Edit: the discrepancy may be that recount has used the GDC Legacy data, where I can see 430 samples for RNA-seq.

ADD REPLYlink modified 12 months ago • written 12 months ago by Kevin Blighe53k

Sorry, the samples on GDC should be actually 379 belonging to TCGA-OV and 1 to TCGA-SARC. Ah! That may explain things.. I downloaded the full TCGA project cause I didnt find how to specify a single study within project with the recount package: download_study(project ='TCGA', type = "rse-gene", outdir = '', download = TRUE, version = 2)

ADD REPLYlink written 12 months ago by Brawni110

I used recount2 cause they have put togheter many RNA-seq studies using same pipeline which is good if you want to compare studies (taking acount of possible batch effects) but actually given that i'm going to compare studies within TCGA project I may as well get them from the GDC portal, couldnt I?

Thanks again, very helpful

ADD REPLYlink written 12 months ago by Brawni110

Depending on time constraint, it may be easier for you to continue to use recount2. There should be no issue in the mismatch in numbers with it and the GDC - all that you must do is quote the version of recount that you used and the sample n in the dataset that was downloaded. With hundreds of 1000s of files across 1000s of samples and > 1 petabyte of data, it is expected that inconsistency between TCGA data and the third-party providers will exist.

Although I prefer to go to the GDC direct, learning how to obtain the data and use it can be cumbersome. For example, to downlaod it, you need to obtain a file manifest and then download the files listed in it using the GDC Data Transfer Tool. You would then have to read each file separately into, for example, R.

Another place you may look is Xena, which has ready-made expression (and other) datasets:

ADD REPLYlink modified 12 months ago • written 12 months ago by Kevin Blighe53k

Yes it's reasonable to have discrepancies, unfortunately I'm interested in acquiring normal samples from ovary and other TCGA studies, but ultimately use also the tumoral data so GTEx for example would not be a solution. I guess i will have to go for the cumbersome option. In any case does 'Legacy' TCGA stand for TCGA data processed by TCGA and not GDC? And the plausible reason of this mismatch (379 vs 430) is due to samples removed after quality control or what?

Thanks a million

ADD REPLYlink written 12 months ago by Brawni110

My memory is coming back to me because I had relatively recently processed the TCGA-OV data. Indeed, I recall that there is either zero or a minimal number of normal RNA-seq samples, which is problematic for you of course. The processing of normal RNA-seq samples did not appear to be a priority for the TCGA project.

The GDC (Genomic Data Commons) is just an interface for sifting through the data that is held for the TCGA project. You would have to read through the notes to fully understand the differences between the GDC and the 'GDC Legacy'. Essentially, data in the GDC is being re-processed (or 'harmonised') using, for example, the updated reference genome. Here is what they say:

Data in the GDC Data Portal has been harmonized using GDC Bioinformatics Pipelines whereas data in the GDC Legacy Archive is an unmodified copy of data that was previously stored in CGHub and in the TCGA Data Portal hosted by the TCGA Data Coordinating Center (DCC).

Certain previously available data types and formats are not currently supported by the GDC Data Portal and are only distributed via the GDC Legacy Archive.


Much of the third party services, like recount2, may use GDC Legacy data due to the fact that these tools were designed before this major update to the TCGA data took place.

I honestly don't know the specific reasons why there are differences in numbers between the GDC Legacy and the [current] GDC, though. It could be because, in the legacy, many samples were processed on different instruments - likely other reasons, though. As an example: some cancers in TCGA have expression data derived from both RNA-seq and also cDNA microarray. On the DNA side, different variant callers were used at different centers who partook in the project.

If you are keen to get your teeth into the TCGA data, then all that I can say is that a lot of patience is needed to wrap your head around all of the terms and the messy threads that exist online that describe this data. Many links are now outdated, for example, and data has been shifted around.

Where normal RNA-seq samples are available from the TCGA, I have already re-processed the data locally and performed a T-N paired differential expression analysis for the main cancers. I have the results files on my disk.

ADD REPLYlink modified 12 months ago • written 12 months ago by Kevin Blighe53k

Hi Kevin, I have been hovering around these data in the last months and yes I understand patience is a necessary requirement, which i dont have much. I looked also at this webtool for investigating batch effects, generated by the MD Anderson Cancer Center. Pretty handy, although again for the ovarian they don't report any normal samples, this time neither for the TCGA nor for the GDC run. Thanks for the kind offer concerning T-N DEG but we will not make use of DE analysis in our project. Ok I will try to dive into it and see if i can get my hands on those few normal samples. I will try also Xena as you suggested.

Thank you so much for your support!!


ADD REPLYlink written 12 months ago by Brawni110
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1233 users visited in the last hour