Entering edit mode
11 days ago
Chris ▴ 70
Hi bioinformaticians, would anyone calculate RPKM from the count matrix with edgeR or DESeq2?
I found some resources guide on this but there are one or two steps I don't know such as this:
y <- DGEList(counts=counts,genes=data.frame(Length=GeneLength)) y <- calcNormFactors(y) RPKM <- rpkm(y)
How to get GeneLength?
How to get annot object?
I appreciate your help!
What organism would you like to get the gene length for? What is the source for the transcriptome you are using? Different organisms have different resources available for calculating gene length.
Thank you for your reply. It is gene length for human. I am sorry for not being sure about the second question, in-house transcriptome.
Is the in-house transcriptome in GTF format? Do you know if you have multiple transcripts per gene? If your counts table reflects counts on genes, one way to get gene length, depending on how you've quantified reads across your transcriptome, would be to read the GTF file into R and create a TxDb object (using makeTxDbFromGFF), then you can extract out all exons by gene into a GRangesList, reduce them, and then sum their widths by gene. There are commands from the GRanges and GenomicFeatures libraries for each of these steps. This would give you a width representing the shadow of all exons for the gene on the genome (essentially all transcripts reduced). It's a bit of a fudge, as for a given locus with multiple transcripts it doesn't actually reflect any one transcript - it sort of represents a general bucket for the locus. In many cases we don't quantify individual isoforms of a locus, we quantify reads mapping to the locus overall. Just some caveats to keep in mind. Maybe others could suggest another approach.
Thank you for your reply!
Error in DGEList(counts = counts, genes = data.frame(Length = gene_length)) : Counts and genes have different numbers of rows
Gene_length has 40192 observations with gene name meanwhile counts > 0 has 33863 observations with ensemple ID. Would you suggest a solution to this error?
If you're sure about your gene lengths, then simply make sure the counts matrix and gene_length matrix match each other. Assuming counts and gene_length have gene names as rownames:
(modify the code as required, I hope the idea is clear)
Thank you for your suggestion! Before using your code, I convert ensembl ID in counts matrix to gene name in gene length table:
'select()' returned 1:many mapping between keys and columns
Would you tell me how to fix the error above? I found this error here:
How to deal with `'select()' returned 1:many mapping between keys and columns` warning when retrieving Entrez IDs using Gene Symbols in R?
However this post is not clear to me to figure out how. I appreciate your help.
Update: I found a solution:
The code is longer but it doesn't have error.
I try to replace the first column with Ensembl ID with the gene symbol. However, I can't get the index of the first column. The first indexing is the count of the first replicate. Would you suggest how to do so? Thank you so much!
Update: I figured it out.
I got gene_length and counts matrix with the same number of row with your suggestion.
Error in DGEList(counts = your_data_frame, genes = data.frame(Length = gene_length)) :The count matrix is a data.frame instead of a matrix and the first column is of class characterinstead of being numeric. Was the first column intended to contain geneids?
I converted row names to column. How can I convert back to overcome the error back? Like this: https://www.geeksforgeeks.org/convert-values-in-column-into-row-names-of-dataframe-in-r/
I tried their solution but it didn't work.
the easiest way is to probably take the numeric columns of your data frame and create a matrix. For instance, if column 1 is gene names, and columns 2:10 are your count values, you could do:
alternatively, if your_data_frame has gene ids as rownames:
Lastly, you can also omit columns with negative subscripts, so if column 1 is gene names:
Then use foo in as your count matrix.
Thank you so much! I figured it out a few hours ago. Could I ask you questions about ATAC-seq?
You're welcome to contact me, but you should just ask the forum overall with a new question. You'll likely get much better answers.
Thank you! How can I contact you?