Hi,
I am analyzing RNA-seq data produced with 3' mRNA-Seq.
I calculated gene counts using HTSEQ-count and I must normalize my counts to perform a Kaplan-Meier analysis. I would like to use CPM normalization considering that I can’t normalize data also based on gene length.
I have a question about the CPM normalization method. Considering that the formula is CPM = ((counts on the features) / library size) X 1,000,000, it normalized the count by the library size.
I was wondering if the library size should be:
- the number of raw sequence reads produced by the sequencing;
- the number of unique mapped reads to the features, i.e. the sum of the counts of all features given by HTSEQ-count.
Thank you!
Concetta
Be aware though that CPM uses the normalisation factors from
calcNormFactors
to correct for library composition and therefore is more sophisticated than correcting only for total read count differences.Thank you! I have another question. To normalize counts, do you suggest to consider also reads unassigned to something, such as unmapped reads and/or multimapping reads?
No. Exclude them.