I have some RNA-seq samples that I want to normalize and then output RPKM expression, and I will use the following commands from EdgeR.
expr <- DGEList(counts=data, group=conditions)
expr <- calcNormFactors(expr)
expr_norm <- rpkm(expr, log=FALSE,gene.length=vector)
Id be very grateful if you could answer these questions.
Q1. When creating the expr <- DGEList(counts=data, group=conditions), what effect does specifying groups have one the TMM normalisation? How does TMM use this information and how would the results differ if you did specify groups versus not?
Q2. The expression data I am using was obtained from mapping reads onto denovo contigs assembled with Trinity. I then chose the most highly expressed contig from each cluster as the "best isoform" and then summed expression across all the contigs in the cluster as the expression value for that cluster. Therefore I do not have one obvious gene length to use. Should I use the longest contig from the cluster?