1
0
Entering edit mode
3.9 years ago
David_emir ▴ 390

Hi All,

I am planning to conduct differential gene exression analysis on TCGA-COAD/READ samples. Currently the samples are ported to GDC portal and its difficult for me to handle.

My question is GDC portal shows ~ 600 samples for Colon under - data.category = "Transcriptome Profiling", data.type = "Gene expression quantification", workflow.type = "HTSeq - FPKM-UQ" . So how can i download these samples as a MATRIX file so that i can conduct Normal V/s Tumor comparison ?

Secondly How to download clinical data for the samples , every time .JASON file gets downloaded and i really don't know how to handle this. Please keep in mind i am not so good at programming/ IT skills and i am a biologist. I would appreciate if you can share a protocol on this. Thanks a lot for your support!!!

Regards, Dav

GDC gene expression RNA-Seq • 3.6k views
8
Entering edit mode
3.9 years ago
svlachavas ▴ 750

Dear David,

firstly, you should keep in mind that it is not appropriate to perform any kind of downstream DE analysis with edgeR, DEseq, etc with "normalized" counts or similar metrics. You would have to use raw counts for appropriate modeling. Secondly, concerning your question of downloading:

query.exp.hg38 <- GDCquery(project = "TCGA-STAD",
data.category = "Transcriptome Profiling",
data.type = "Gene Expression Quantification",
workflow.type = "HTSeq - Counts") # example query

exp.hg38 <- GDCprepare(query = query.exp.hg38)


## and for the clinical data:

SummarizedExperiment::colData(exp.hg38) ## which contains all the clinical, phenotype and subtype relative information


Hope that helps,

Efstathios

0
Entering edit mode

Thanks a lot Efstathios, i will keep that in mind. Going further, I have successfully downloaded the files, now i have a file "gdc_download_20171109_051908 " in that i have around 645 sub files, where in each file is .zip when i extract this i will be having something like ENSG00000242268.2 0.0 ENSG00000270112.3 0.0 ENSG00000167578.15 90864.4084112 Now i have one more problem, if i need to have all 645 files as a matrix how can i go about this? should i manually copy paste each of the file? Please help. Regards. Dav

0
Entering edit mode

Dear David,

have you followed the above commands exactly ?

because, after this you don't have to do anything with zip files and related stuff-you will have your RangedSummarizedExperiment ready with the raw counts and the phenotype data.

0
Entering edit mode

Dear Efstathios, I have done exactly the same, except used TCGA-COAD. and i am using Centos 7 as OS.

0
Entering edit mode

Error in checkProjectInput(project) :    Please set a valid project argument from the column id above. Project TCGA-COAD was not


found. In addition: Warning messages: 1: Unnamed col_types should have the same length as col_names. Using smaller of the two. 2: In rbind(names(probs), probs_f) : number of columns of result is not a multiple of vector length (arg 1) 3: Unknown or uninitialised column: 'project_id'. 4: Unknown or uninitialised column: 'project_id'

0
Entering edit mode

What version of the R package TCGABiolinks do you have ? probably you would have to install the github version after firstly remove any prior installed TCGABiolinks library:

devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")


also check:

source("https://bioconductor.org/biocLite.R")
biocLite("GenomeInfoDbData")

0
Entering edit mode

Hi Efstathios, I have uninstalled TCGAbiolinks and installed and this time it actually worked, Thanks a lot. But, after executing the command I am not able to find any matrix file in the folder, it downloaded 521 files and each file has a .zip file Codes are as follows

library(TCGAbiolinks) query.exp.hg38 <- GDCquery(project = "TCGA-COAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts") # example query

exp.hg38 <- GDCprepare(query = query.exp.hg38)

SummarizedExperiment::colData(exp.hg38) ###clinical Data

Can i save the matrix file ?

0
Entering edit mode
exp.hg38 <- GDCprepare(query = query.exp.hg38, save = TRUE, save.filename = "exp.rda") # save the object

head(assay(exp.hg38), 3) # example of matrix counts information


0
Entering edit mode

Thanks, i am sorry if i am annoying.

Can this matrix file be saved as .csv format?

0
Entering edit mode
exp.hg38.values <- assay(exp.hg38)

rownames(exp.hg38.values) <- values(exp.hg38)\$external_gene_name #gene symbols


But i do not understand the logic why someone would like to inspect a csv with more than 600 columns, and near 60.000 rows ? You have some purpose for this ? And don't proceed directly in R with the manipulation of this object ?

0
Entering edit mode

I tried my best but could,t save the matrix file & clinical data file :(