Question: How to obtain RPKM/FPKM from HTSeq-Count data?
1
gravatar for alcs417
24 months ago by
alcs41770
alcs41770 wrote:

Hi there,

I am currently working on a project related to pan-cancer analysis. I have done differential expression analysis for miRNAs and genes with edgeR. I know that edgeR only takes raw counts as input, so I downloaded the HTSeq-Count data from GDC data portal. After the differential expression analysis, I'd like to obtain the normalized expression values of miRNA/genes (RPKM/FPKM) for the downstream analysis, such as using pearson correlation between miRNAs and mRNAs to construct a miRNA-mRNA regulatory networks and so on. However, I got stuck for days on how to get the normalized expression values of both miRNAs and genes. Here are my questions:

  1. There is a function called "cpm"(counts per million) in edgeR, but it says it doesn't take the gene length into accout; edgeR also provides another version of normalized counts "pseodu.counts", however, someone says this is quite difficult to interpret. So I am wondering if I could use "logCPM" as the normalized expression values for the downstream analysis?

  2. If not, I realized that there is also a function called "rpkm" in edgeR which could calculate the normalized expression values. However, it needs the gene/microRNA length information to make it work. I do not know where to find the length information for genes and microRNAs, since there is no such information contained in the HTSeq-Count file. Could any one please tell me how to do it? Should I download the gene information from ENSEMBLE and the miRNA information from mirbase? And calculate the length information by myself? Is there any R package that could do the work instead?

  3. Could I just download the RPKM files of miRNAs and genes from GDC data portal to construct the miRNA-mRNA regulatory network? Is that right? It seems to be the easiest way for me to do though....

Any help would be really appreciated.

ADD COMMENTlink modified 24 months ago by geek_y9.8k • written 24 months ago by alcs41770
1
gravatar for geek_y
24 months ago by
geek_y9.8k
Barcelona
geek_y9.8k wrote:

There is no much use of calculating RPKM/FPKM for miRNAs.

To answer your question, you can use featureCounts to quantify your genes using a GTF file. This outputs a matrix of counts, which also includes a column of gene lengths. This column can be given to edgeR rpkm() function.

Option 3 would make sense. If you already have normalised data available, you can use to calculate the correlations.

ADD COMMENTlink modified 24 months ago • written 24 months ago by geek_y9.8k

Could you please tell me where to download the GTF file? It seems that TCGA does not have this file for genes? Also, how can I obtain normalized miRNA expression data from HTSeq-Count data then? Thanks.

ADD REPLYlink written 24 months ago by alcs41770

I assumed that you already have the RPKM from this "Could I just download the RPKM files of miRNAs and genes from GDC data portal"

I would suggest to speak to someone in your workplace who does bioinformatics. If you are starting with RNA-Seq analysis and wants to work with TCGA, it needs lot of work and guidance.

ADD REPLYlink written 24 months ago by geek_y9.8k

I eventually found the GTF file from the ensembl website. Anyway, really thanks for your help. I am working on it.

ADD REPLYlink written 24 months ago by alcs41770
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 935 users visited in the last hour