I am currently working on a project related to pan-cancer analysis. I have done differential expression analysis for miRNAs and genes with edgeR. I know that edgeR only takes raw counts as input, so I downloaded the HTSeq-Count data from GDC data portal. After the differential expression analysis, I'd like to obtain the normalized expression values of miRNA/genes (RPKM/FPKM) for the downstream analysis, such as using pearson correlation between miRNAs and mRNAs to construct a miRNA-mRNA regulatory networks and so on. However, I got stuck for days on how to get the normalized expression values of both miRNAs and genes. Here are my questions:
There is a function called "cpm"(counts per million) in edgeR, but it says it doesn't take the gene length into accout; edgeR also provides another version of normalized counts "pseodu.counts", however, someone says this is quite difficult to interpret. So I am wondering if I could use "logCPM" as the normalized expression values for the downstream analysis?
If not, I realized that there is also a function called "rpkm" in edgeR which could calculate the normalized expression values. However, it needs the gene/microRNA length information to make it work. I do not know where to find the length information for genes and microRNAs, since there is no such information contained in the HTSeq-Count file. Could any one please tell me how to do it? Should I download the gene information from ENSEMBLE and the miRNA information from mirbase? And calculate the length information by myself? Is there any R package that could do the work instead?
Could I just download the RPKM files of miRNAs and genes from GDC data portal to construct the miRNA-mRNA regulatory network? Is that right? It seems to be the easiest way for me to do though....
Any help would be really appreciated.