Download TCGA and GTEX data from Xena toilHub for (full genome but for 1 cancer/tissue type)
1
0
Entering edit mode
2.6 years ago
erica.fary ▴ 20

Dear All,

I would like to download TCGA and GTEX gene expression data for ovarian cancer and ovary respectively from the Xena toilHub platform (all genes; RSEM expected counts). However, I only found this web page (link) with the links to download the full dataset (samples for all cancer/tissue types). And if I try to process this file to select only the patients/samples of interest, this makes my computer crash (too heavy file).

Is it a way to download from toilHub only the expression data for the samples that correspond to "TCGA Ovarian Serous Cystadenocarcinoma" (TCGA) or "Ovary" (GTEX) ?

I have also tried with the R package "UCSCXenaTools", but without success... So I would be interested if anyone could provide both the website download solution + the R package download solution...

I have already spent an half-day on that so I would be very grateful if anyone could help !

Side-question: the dataset with all samples one can see here holds expression values for > 60'000 ensembl gene identifiers. How is it possible (different transcripts for a same gene ?) ? What is the best way to map/aggregate the expression to unique gene symbols (e.g. HGNC gene symbols) ?

Thanks a lot

GTEX UCSCXenaTools TCGA RNA-seq Xena • 1.2k views
ADD COMMENT
1
Entering edit mode
2.6 years ago
dsull ★ 5.8k

A few suggestions:

  1. If you're loading the entire dataset into an R dataframe, use a server with more memory
  2. If you're only using your computer, look into using shell scripts (e.g. awk) to select your samples of interest rather than using R to read in the entire dataset.

Also, ~60,000 gene identifiers sounds correct for me. Per http://uswest.ensembl.org/Homo_sapiens/Info/Annotation -- there are 20,442 coding genes, 23,982 non-coding genes, and 15,228 pseudogenes (the exact number varies between different releases).

NCBI has annotations that can help you map between Ensembl, NCBI ID, and Gene Symbol: https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2ensembl.gz https://ftp.ncbi.nlm.nih.gov/gene/DATA/GENE_INFO/Mammalia/Homo_sapiens.gene_info.gz

ADD COMMENT

Login before adding your answer.

Traffic: 2522 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6