Question

TCGA - COAD read data gene expression analysis

0

Entering edit mode

6.4 years ago

David_emir ▴ 490

Hi All,

I am planning to conduct differential gene exression analysis on TCGA-COAD/READ samples. Currently the samples are ported to GDC portal and its difficult for me to handle.

My question is GDC portal shows ~ 600 samples for Colon under - data.category = "Transcriptome Profiling", data.type = "Gene expression quantification", workflow.type = "HTSeq - FPKM-UQ" . So how can i download these samples as a MATRIX file so that i can conduct Normal V/s Tumor comparison ?

Secondly How to download clinical data for the samples , every time .JASON file gets downloaded and i really don't know how to handle this. Please keep in mind i am not so good at programming/ IT skills and i am a biologist. I would appreciate if you can share a protocol on this. Thanks a lot for your support!!!

Regards, Dav

GDC gene expression RNA-Seq • 4.9k views

ADD COMMENT • link updated 6.4 years ago by svlachavas ▴ 790 • written 6.4 years ago by David_emir ▴ 490

score 8 · Answer 1 · 2017-11-09

8

Entering edit mode

6.4 years ago

svlachavas ▴ 790

Dear David,

firstly, you should keep in mind that it is not appropriate to perform any kind of downstream DE analysis with edgeR, DEseq, etc with "normalized" counts or similar metrics. You would have to use raw counts for appropriate modeling. Secondly, concerning your question of downloading:

query.exp.hg38 <- GDCquery(project = "TCGA-STAD", 
                           data.category = "Transcriptome Profiling", 
                           data.type = "Gene Expression Quantification", 
                           workflow.type = "HTSeq - Counts") # example query

GDCdownload(query.exp.hg38,files.per.chunk = 50)

exp.hg38 <- GDCprepare(query = query.exp.hg38)

## and for the clinical data:

SummarizedExperiment::colData(exp.hg38) ## which contains all the clinical, phenotype and subtype relative information

Hope that helps,

Efstathios

ADD COMMENT • link 6.4 years ago by svlachavas ▴ 790

0

Entering edit mode

Thanks a lot Efstathios, i will keep that in mind. Going further, I have successfully downloaded the files, now i have a file "gdc_download_20171109_051908 " in that i have around 645 sub files, where in each file is .zip when i extract this i will be having something like ENSG00000242268.2 0.0 ENSG00000270112.3 0.0 ENSG00000167578.15 90864.4084112 Now i have one more problem, if i need to have all 645 files as a matrix how can i go about this? should i manually copy paste each of the file? Please help. Regards. Dav

ADD REPLY • link 6.4 years ago by David_emir ▴ 490

0

Entering edit mode

Dear David,

have you followed the above commands exactly ?

because, after this you don't have to do anything with zip files and related stuff-you will have your RangedSummarizedExperiment ready with the raw counts and the phenotype data.

ADD REPLY • link 6.4 years ago by svlachavas ▴ 790

0

Entering edit mode

Dear Efstathios, I have done exactly the same, except used TCGA-COAD. and i am using Centos 7 as OS.

ADD REPLY • link 6.4 years ago by David_emir ▴ 490

0

Entering edit mode

I am also Getting the following error Please help, i am stuck here :(

Error in checkProjectInput(project) :    Please set a valid project argument from the column id above. Project TCGA-COAD was not
  
found. In addition: Warning messages: 1: Unnamed col_types should have the same length as col_names. Using smaller of the two. 2: In rbind(names(probs), probs_f) : number of columns of result is not a multiple of vector length (arg 1) 3: Unknown or uninitialised column: 'project_id'. 4: Unknown or uninitialised column: 'project_id'

ADD REPLY • link 6.4 years ago by David_emir ▴ 490

0

Entering edit mode

What version of the R package TCGABiolinks do you have ? probably you would have to install the github version after firstly remove any prior installed TCGABiolinks library:

devtools::install_github(repo = "BioinformaticsFMRP/TCGAbiolinks")

also check:

source("https://bioconductor.org/biocLite.R")
biocLite("GenomeInfoDbData")

ADD REPLY • link 6.4 years ago by svlachavas ▴ 790

0

Entering edit mode

Hi Efstathios, I have uninstalled TCGAbiolinks and installed and this time it actually worked, Thanks a lot. But, after executing the command I am not able to find any matrix file in the folder, it downloaded 521 files and each file has a .zip file Codes are as follows

library(TCGAbiolinks) query.exp.hg38 <- GDCquery(project = "TCGA-COAD", data.category = "Transcriptome Profiling", data.type = "Gene Expression Quantification", workflow.type = "HTSeq - Counts") # example query

GDCdownload(query.exp.hg38,files.per.chunk = 50)

exp.hg38 <- GDCprepare(query = query.exp.hg38)

SummarizedExperiment::colData(exp.hg38) ###clinical Data

Can i save the matrix file ?

ADD REPLY • link 6.4 years ago by David_emir ▴ 490

0

Entering edit mode

exp.hg38 <- GDCprepare(query = query.exp.hg38, save = TRUE, save.filename = "exp.rda") # save the object

head(assay(exp.hg38), 3) # example of matrix counts information

head(colData(exp.hg38)) # phenotype information

ADD REPLY • link 6.4 years ago by svlachavas ▴ 790

0

Entering edit mode

Thanks, i am sorry if i am annoying.

Can this matrix file be saved as .csv format?

ADD REPLY • link 6.4 years ago by David_emir ▴ 490

0

Entering edit mode

exp.hg38.values <- assay(exp.hg38)

rownames(exp.hg38.values) <- values(exp.hg38)$external_gene_name #gene symbols
write.csv(exp.hg38.values,file = "stad_exp_hg38_htseq_counts.csv")

But i do not understand the logic why someone would like to inspect a csv with more than 600 columns, and near 60.000 rows ? You have some purpose for this ? And don't proceed directly in R with the manipulation of this object ?