HT-Seq count data coding gene
1
0
Entering edit mode
3.6 years ago
Rob ▴ 170

Hi friends I am using R code to download HT-Seq count data Does anybody know what line should I add to the code to download only coding genes? because my code output is all genes (coding& non-coding)

my code:

  CancerProject <- "TCGA-KIRC"
query <- GDCquery(project = CancerProject,
                  data.category = "Transcriptome Profiling",
                  data.type = "Gene Expression Quantification",
                  sample.type = c("Primary Tumor"),
                  workflow.type = "HTSeq - Counts")
#download raw counts for DESEq2
GDCdownload(query)
data <- GDCprepare(query, save = TRUE, save.filename = "exp.rda")
rna <- as.data.frame(SummarizedExperiment::assay(data)) # exp matrix
write.csv(rna, "rna.csv")
clinical <- data.frame(data@colData) # associated clinical data
write.csv(clinical, "clinical.csv")
rna-seq gene • 662 views
ADD COMMENT
2
Entering edit mode
3.6 years ago

To my end, there is no way to do this in data download step, since there is no annotation on what genes are coding for available in TCGA data. You can retrive data on Gene type from BioMart via biomaRt package in a data.frame and then use this to filter your matrix.

This function should retrive needed data from BioMart:

library(biomaRt)

gen.type <-function(ids){
  mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
  anno <- getBM(filters= "ensembl_gene_id", # if your expression matrix is gene symbol use "hgnc_symbol"
                 attributes= c("ensembl_gene_id","hgnc_symbol", "gene_biotype"),
                 values=ids, mart= mart)
  return(anno)
}

df <- gen.type(row.names(rna)) # this will return a data frame contains Ensembl gene id, gene symbol and gene type.
ADD COMMENT

Login before adding your answer.

Traffic: 1998 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6