Question: Calculating TPM from featureCounts output
4
gravatar for nash.claire
2.0 years ago by
nash.claire250
Canada
nash.claire250 wrote:

Hi all,

Have a simple question but just want to double check I'm not doing something stupid. 

I have paired-end RNA-seq data for which I have used featureCounts to quantify raw counts. I now want to normalize using the TPM formula. I read this blog :- 

http://www.rna-seqblog.com/rpkm-fpkm-and-tpm-clearly-explained/

which simply says, divide read counts by gene length in kilobases to give reads per kilobase (RPK), sum all the RPK values and divide by a million for a per million scaling factor and then divide all RPK values by this scaling factor. 

So taking my output from featureCounts which looks something like this : -

Geneid Chr Start End Strand Length sample.bam
NM_032291 chr1 66999639 etc 67000051 etc + etc etc 10934 25

I use the values in the "Length" column as reads per kilobase or do I have to convert this to per kilobase first? It didn't say in the Subread manual much about this length value. And the value in my "sample.bam" column is definitely the read count value I need?? 

rna-seq • 10k views
ADD COMMENTlink modified 2.0 years ago by Kamil1.7k • written 2.0 years ago by nash.claire250
8
gravatar for Kamil
2.0 years ago by
Kamil1.7k
Boston
Kamil1.7k wrote:

Consider using the tximport R package:

https://bioconductor.org/packages/devel/bioc/vignettes/tximport/inst/doc/tximport.html


Below, I'm sharing a function I have used to convert counts to TPM.

See here for a script that gets the number of coding bases per gene.

See here for a script that gets the mean fragment length for each library.

ADD COMMENTlink modified 3 months ago • written 2.0 years ago by Kamil1.7k

Simple and clean, thanks for the function.

As a suggestion, shouldn't there be a parameter to keep feature names? As in:

counts_to_tpm <- function(counts, featureLength, meanFragmentLength, featureName)
# Ensure valid arguments.
stopifnot(length(featureLength) == nrow(counts))
stopifnot(length(meanFragmentLength) == ncol(counts))

# Compute effective lengths of features in each library.
 effLen <- do.call(cbind, lapply(1:ncol(counts), function(i) {
 featureLength - meanFragmentLength[i] + 1
}))

# Exclude genes with length less than the mean fragment length.
idx <- apply(effLen, 1, function(x) min(x) > 1)
counts <- counts[idx,]
effLen <- effLen[idx,]
featureLength <- featureLength[idx]
###  featureNameAdjusted <- featureName[idx]

# Process one column at a time.
tpm <- do.call(cbind, lapply(1:ncol(counts), function(i) {
rate = log(counts[,i]) - log(effLen[,i])
denom = log(sum(exp(rate)))
exp(rate - denom + log(1e6))
}))

# Copy the column names from the original matrix add feature names to rownames (or extra column?)
colnames(tpm) <- colnames(counts)
###rownames(tpm) <- featureNameAdjusted
return(tpm)
}
ADD REPLYlink modified 14 months ago • written 14 months ago by Fernando Rossello0

Thanks for the suggestion! I modified the function for you.

ADD REPLYlink written 14 months ago by Kamil1.7k

Great Kamil! Thanks.

ADD REPLYlink written 14 months ago by Fernando Rossello0
5
gravatar for Rob
2.0 years ago by
Rob2.2k
United States
Rob2.2k wrote:

First, let me suggest that you'd probably be better off using a tool for explicitly estimating relative abundance than processing the output of a tool like featureCounts (see e.g. this recent manuscript by Soneson et al.). To answer your question, that's a very round-about way of computing TPM, which seems to introduce some arbitrary scaling factors for no real reason (e.g. why compute RPK rather than just reads per base?).  So, anyway, you can compute TPM in the following manner.  Let N_i be the number of reads coming from gene i and L_i be the length of gene i. Then TPM_i is given by the following formula:

TPM_i = 10^6 * (N_i / L_i) / [Σ_j (N_j / L_j)]

That is, you first compute the rate of fragments per base for each gene (given by N_i / L_i for gene i), then you divide by the sum of this quantity for all genes --- this normalizes the measurements so they sum to 1.  Now, you have a measurement with a non-negative value for each gene that sums to 1 (is a partition of unity); this is called the "transcript fraction" (or, in your case, "gene fraction"). You simply multiply this by 1 million to convert it to "transcripts per million" (TPM).

ADD COMMENTlink modified 2.0 years ago • written 2.0 years ago by Rob2.2k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1273 users visited in the last hour