Have a simple question but just want to double check I'm not doing something stupid.
I have paired-end RNA-seq data for which I have used featureCounts to quantify raw counts. I now want to normalize using the TPM formula. I read this blog :-
which simply says, divide read counts by gene length in kilobases to give reads per kilobase (RPK), sum all the RPK values and divide by a million for a per million scaling factor and then divide all RPK values by this scaling factor.
So taking my output from featureCounts which looks something like this : -
Geneid Chr Start End Strand Length sample.bam NM_032291 chr1 66999639 etc 67000051 etc + etc etc 10934 25
I use the values in the "Length" column as reads per kilobase or do I have to convert this to per kilobase first? It didn't say in the Subread manual much about this length value. And the value in my "sample.bam" column is definitely the read count value I need?
The paper to cite seems to be Bo Li et al. 2010: https://academic.oup.com/bioinformatics/article/26/4/493/243395 - see page 494 eq. 2, though it references a paper within, and the RNAseq review paper Conesa et al. 2016 references Pachter 2011: https://arxiv.org/pdf/1104.3889.pdf.