Calculate RPKM
0
1
Entering edit mode
11 days ago
Chris ▴ 70

Hi bioinformaticians, would anyone calculate RPKM from the count matrix with edgeR or DESeq2?

I found some resources guide on this but there are one or two steps I don't know such as this:

y <- DGEList(counts=counts,genes=data.frame(Length=GeneLength))
y <- calcNormFactors(y)
RPKM <- rpkm(y)


How to get GeneLength?

https://github.com/mgonzalezporta/TeachingMaterial/blob/master/doc/25.normalising.md

How to get annot object?

RPKM • 738 views
1
Entering edit mode

What organism would you like to get the gene length for? What is the source for the transcriptome you are using? Different organisms have different resources available for calculating gene length.

0
Entering edit mode

Thank you for your reply. It is gene length for human. I am sorry for not being sure about the second question, in-house transcriptome.

1
Entering edit mode

Is the in-house transcriptome in GTF format? Do you know if you have multiple transcripts per gene? If your counts table reflects counts on genes, one way to get gene length, depending on how you've quantified reads across your transcriptome, would be to read the GTF file into R and create a TxDb object (using makeTxDbFromGFF), then you can extract out all exons by gene into a GRangesList, reduce them, and then sum their widths by gene. There are commands from the GRanges and GenomicFeatures libraries for each of these steps. This would give you a width representing the shadow of all exons for the gene on the genome (essentially all transcripts reduced). It's a bit of a fudge, as for a given locus with multiple transcripts it doesn't actually reflect any one transcript - it sort of represents a general bucket for the locus. In many cases we don't quantify individual isoforms of a locus, we quantify reads mapping to the locus overall. Just some caveats to keep in mind. Maybe others could suggest another approach.

0
Entering edit mode

gene_length <- aggregate(width, list(gtf_exon$gene_name), FUN = sum) y <- DGEList(counts=counts,genes=data.frame(Length=gene_length))  Error in DGEList(counts = counts, genes = data.frame(Length = gene_length)) : Counts and genes have different numbers of rows Gene_length has 40192 observations with gene name meanwhile counts > 0 has 33863 observations with ensemple ID. Would you suggest a solution to this error? ADD REPLY 1 Entering edit mode If you're sure about your gene lengths, then simply make sure the counts matrix and gene_length matrix match each other. Assuming counts and gene_length have gene names as rownames: # get the gene_lengths that match the counts matrix gene_length <- gene_length[match(rownames(counts), rownames(gene_length)),] # check the results all(rownames(counts) == rownames(gene_length))  (modify the code as required, I hope the idea is clear) ADD REPLY 0 Entering edit mode Thank you for your suggestion! Before using your code, I convert ensembl ID in counts matrix to gene name in gene length table: raw_counts$symbol <- mapIds(org.Hs.eg.db, keys = rownames(raw_counts), keytype = "ENSEMBL", column = "SYMBOL")


'select()' returned 1:many mapping between keys and columns

Would you tell me how to fix the error above? I found this error here:

How to deal with 'select()' returned 1:many mapping between keys and columns warning when retrieving Entrez IDs using Gene Symbols in R?

However this post is not clear to me to figure out how. I appreciate your help.

Update: I found a solution:

if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")

BiocManager::install("biomaRt")
library(biomaRt)
your_data_frame <- as.data.frame(counts)
mart <- useDataset("hsapiens_gene_ensembl", useMart("ensembl"))
gene_IDs <- getBM(filters= "ensembl_gene_id", attributes= c("ensembl_gene_id","hgnc_symbol"), values = rownames(counts), mart= mart)

your_data_frame\$GeneName <- gene_IDs[match(rownames(your_data_frame),gene_IDs[,1]),2]


The code is longer but it doesn't have error.

I try to replace the first column with Ensembl ID with the gene symbol. However, I can't get the index of the first column. The first indexing is the count of the first replicate. Would you suggest how to do so? Thank you so much!

Update: I figured it out.

0
Entering edit mode

I got gene_length and counts matrix with the same number of row with your suggestion.

y <-DGEList(counts=your_data_frame,genes=data.frame(Length=gene_length))


Error in DGEList(counts = your_data_frame, genes = data.frame(Length = gene_length)) :The count matrix is a data.frame instead of a matrix and the first column is of class characterinstead of being numeric. Was the first column intended to contain geneids?

I converted row names to column. How can I convert back to overcome the error back? Like this: https://www.geeksforgeeks.org/convert-values-in-column-into-row-names-of-dataframe-in-r/

I tried their solution but it didn't work.

1
Entering edit mode

the easiest way is to probably take the numeric columns of your data frame and create a matrix. For instance, if column 1 is gene names, and columns 2:10 are your count values, you could do:

foo <- as.matrix(your_data_frame[,2:10])
rownames(foo) <- your_data_frame[,1]


alternatively, if your_data_frame has gene ids as rownames:

foo <- as.matrix(your_data_frame[,2:10])
rownames(foo) <- rownames(your_data_frame)


Lastly, you can also omit columns with negative subscripts, so if column 1 is gene names:

foo <- as.matrix(your_data_frame[,-1])
rownames(foo) <- rownames(your_data_frame)


Then use foo in as your count matrix.

0
Entering edit mode

Thank you so much! I figured it out a few hours ago. Could I ask you questions about ATAC-seq?

1
Entering edit mode

You're welcome to contact me, but you should just ask the forum overall with a new question. You'll likely get much better answers.

0
Entering edit mode

Thank you! How can I contact you?