Download TCGA and GTEX data from Xena toilHub for (full genome but for 1 cancer/tissue type)
11 months ago
erica.fary • 0

Dear All,

I would like to download TCGA and GTEX gene expression data for ovarian cancer and ovary respectively from the Xena toilHub platform (all genes; RSEM expected counts). However, I only found this web page (link) with the links to download the full dataset (samples for all cancer/tissue types). And if I try to process this file to select only the patients/samples of interest, this makes my computer crash (too heavy file).

Is it a way to download from toilHub only the expression data for the samples that correspond to "TCGA Ovarian Serous Cystadenocarcinoma" (TCGA) or "Ovary" (GTEX) ?

I have also tried with the R package "UCSCXenaTools", but without success... So I would be interested if anyone could provide both the website download solution + the R package download solution...

I have already spent an half-day on that so I would be very grateful if anyone could help !

Side-question: the dataset with all samples one can see here holds expression values for > 60'000 ensembl gene identifiers. How is it possible (different transcripts for a same gene ?) ? What is the best way to map/aggregate the expression to unique gene symbols (e.g. HGNC gene symbols) ?

Thanks a lot

GTEX UCSCXenaTools TCGA RNA-seq Xena • 560 views
11 months ago
dsull ★ 3.2k

A few suggestions:

  1. If you're loading the entire dataset into an R dataframe, use a server with more memory
  2. If you're only using your computer, look into using shell scripts (e.g. awk) to select your samples of interest rather than using R to read in the entire dataset.

Also, ~60,000 gene identifiers sounds correct for me. Per -- there are 20,442 coding genes, 23,982 non-coding genes, and 15,228 pseudogenes (the exact number varies between different releases).

NCBI has annotations that can help you map between Ensembl, NCBI ID, and Gene Symbol:


