TCGA gene expression quantitation batch information
1
1
Entering edit mode
8 months ago
wrab425 ▴ 50

How can one establish the batch information for a TCGA gene expression quantitation TSV file?

TCGA • 544 views
ADD COMMENT
0
Entering edit mode
8 months ago
350769816 ▴ 10

I add samples to cart at GDC data portal, then downloaded them. I merge them with the R code below. Hope this will help

library(data.table)
library(tidyverse)

##You can generate gdc_sample_sheet.tsv at GDC data portal
index=read.table("gdc_sample_sheet.tsv",sep="\t",header=TRUE)
index=index[order(index$Sample.ID),]

##read files
setwd("where_you_download_your_data")
expr_file=index$File.Name
mat=do.call(cbind,lapply(as.character(expr_file),function(x){fread(x,header=T,sep="\t")[,c(4)]}))
exp_mat=read.table(as.character(index$File.Name[1]),sep="\t",header=T)
mat=data.frame(exp_mat$gene_id,exp_mat$gene_name,exp_mat$gene_type,mat)
mat=mat[5:nrow(mat),]
colnames(mat)=c("ensembl_gene_id","hgnc_symbol","gene_biotype",index$new_id)
ensg_id=unlist(strsplit(as.character(mat$ensembl_gene_id),split="[.]"))
ensg_id=ensg_id[grep("ENSG*",ensg_id)]
mat$ensembl_gene_id=ensg_id
write.table(mat,"TCGA.tsv",row.names = FALSE,col.names = TRUE,sep="\t",quote=FALSE)
ADD COMMENT
0
Entering edit mode

one coding suggestion

using do.call cbind may have performance issue when dealing with large amount of data. each call here will create another data object, and replace the initial one.

predicting the size of the final data frame/matrix, and pre-create that object in advance is likely to be much faster for large amount of data.

ADD REPLY

Login before adding your answer.

Traffic: 1392 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6