How to get RPKM from count matrix
5 months ago
Chris ▴ 260

Hi Biostars,

I have a count matrix with mouse gene name and need to get RPKM. I know it is not a good metric but biologists used to it.

gtf <- readGFF("/reference_genome/mm39.ncbiRefSeq.gtf")
gtf_exon <- gtf[gtf$type == "exon", ]
width <- gtf_exon$end - gtf_exon$start + 1
gene_length <- aggregate(width, list(gtf_exon$gene_name), FUN = sum)
row.names(gene_length) <- gene_length$gene_name # may work
colnames(gene_length) <- c("gene_name", "gene_length")
gene_length <- gene_length %>% dplyr::select('gene_length')
gene_length <- gene_length[match(rownames(counts_mouse), rownames(gene_length)),]
y  <- DGEList(counts=counts_matrix, genes=data.frame(Length=gene_length)) 
y  <- calcNormFactors(y)
RPKM <- rpkm(y)

I looked for the gtf file to get the gene length but all the gtf files I found is not in gene name format. Would you please have a suggestion? Thank you so much!

Update: so many genes like this 1700012P22Rik at the beginning of the matrix make me think it is not gene symbol format.

