Hi all,
Have a simple question but just want to double check I'm not doing something stupid.
I have paired-end RNA-seq data for which I have used featureCounts to quantify raw counts. I now want to normalize using the TPM formula. I read this blog :-
http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/
which simply says, divide read counts by gene length in kilobases to give reads per kilobase (RPK), sum all the RPK values and divide by a million for a per million scaling factor and then divide all RPK values by this scaling factor.
So taking my output from featureCounts which looks something like this : -
Geneid      Chr    Start          End            Strand      Length   sample.bam
NM_032291   chr1   66999639 etc   67000051 etc   + etc etc   10934    25
I use the values in the "Length" column as reads per kilobase or do I have to convert this to per kilobase first? It didn't say in the Subread manual much about this length value. And the value in my "sample.bam" column is definitely the read count value I need?
The paper to cite seems to be Bo Li et al. 2010: https://academic.oup.com/bioinformatics/article/26/4/493/243395 - see page 494 eq. 2, though it references a paper within, and the RNAseq review paper Conesa et al. 2016 references Pachter 2011: https://arxiv.org/pdf/1104.3889.pdf.
Is there any way, how I can calculate TPM from count data from a publicly available dataset, if I don't have the meanFragmentLength? I can't find it anywhere.