How to obtain RPKM/FPKM from HTSeq-Count data?
1
2
Entering edit mode
6.6 years ago
alcs417 ▴ 100

Hi there,

I am currently working on a project related to pan-cancer analysis. I have done differential expression analysis for miRNAs and genes with edgeR. I know that edgeR only takes raw counts as input, so I downloaded the HTSeq-Count data from GDC data portal. After the differential expression analysis, I'd like to obtain the normalized expression values of miRNA/genes (RPKM/FPKM) for the downstream analysis, such as using pearson correlation between miRNAs and mRNAs to construct a miRNA-mRNA regulatory networks and so on. However, I got stuck for days on how to get the normalized expression values of both miRNAs and genes. Here are my questions:

  1. There is a function called "cpm"(counts per million) in edgeR, but it says it doesn't take the gene length into accout; edgeR also provides another version of normalized counts "pseodu.counts", however, someone says this is quite difficult to interpret. So I am wondering if I could use "logCPM" as the normalized expression values for the downstream analysis?

  2. If not, I realized that there is also a function called "rpkm" in edgeR which could calculate the normalized expression values. However, it needs the gene/microRNA length information to make it work. I do not know where to find the length information for genes and microRNAs, since there is no such information contained in the HTSeq-Count file. Could any one please tell me how to do it? Should I download the gene information from ENSEMBLE and the miRNA information from mirbase? And calculate the length information by myself? Is there any R package that could do the work instead?

  3. Could I just download the RPKM files of miRNAs and genes from GDC data portal to construct the miRNA-mRNA regulatory network? Is that right? It seems to be the easiest way for me to do though....

Any help would be really appreciated.

TCGA; RNA-Seq; edgeR; expression analysis • 5.0k views
ADD COMMENT
1
Entering edit mode
6.6 years ago

There is no much use of calculating RPKM/FPKM for miRNAs.

To answer your question, you can use featureCounts to quantify your genes using a GTF file. This outputs a matrix of counts, which also includes a column of gene lengths. This column can be given to edgeR rpkm() function.

Option 3 would make sense. If you already have normalised data available, you can use to calculate the correlations.

ADD COMMENT
0
Entering edit mode

Could you please tell me where to download the GTF file? It seems that TCGA does not have this file for genes? Also, how can I obtain normalized miRNA expression data from HTSeq-Count data then? Thanks.

ADD REPLY
0
Entering edit mode

I assumed that you already have the RPKM from this "Could I just download the RPKM files of miRNAs and genes from GDC data portal"

I would suggest to speak to someone in your workplace who does bioinformatics. If you are starting with RNA-Seq analysis and wants to work with TCGA, it needs lot of work and guidance.

ADD REPLY
0
Entering edit mode

I eventually found the GTF file from the ensembl website. Anyway, really thanks for your help. I am working on it.

ADD REPLY

Login before adding your answer.

Traffic: 1468 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6