How to calculate RPKM from featureCounts output
3
0
Entering edit mode
7.2 years ago
gundalav ▴ 380

Lets say the featureCounts output is this (for convenience only contain 2 genes:

Geneid  Chr     Start   End     Strand  Length  myfile.bam
Xkr4    chr1;chr1;chr1  3214482;3421702;3670552 3216968;3421901;3671498 -;-;-   3634    100
Npbwr1  chr1    5913707 5917398 -       3692   30

As far as I know the formula of RPKM is

RPKM = count_assigned_to_gene / (gene_length/1000 * totalReads /1e6)

My question is where can I get totalReads? Is it directly from the featureCounts output (100 + 30 ==130)?

Xkr4 = 100 / (3634/1000 * 130 / 1e6)

Or should I get it from BAM files e.g. bamtools stats.

RNA-Seq • 10k views
ADD COMMENT
1
Entering edit mode
7.2 years ago
Benn 8.3k

Why don't you use edgeR for RPKM calculation?

You'll have to import the counts file first, and then you can use the values from the Length column also for the calculation.

ADD COMMENT
0
Entering edit mode

@b.nota in other words the totalReads in the calculation is not the total assigned reads to the genes? So I have to directly count from the BAM file? Would prefer to recode cause later need to calculate TPM as well.

ADD REPLY
1
Entering edit mode

From edgeR help page:

lib.size: library size, defaults to colSums(x)

So to answer your question, your totalReads should be the sum of the whole sample column indeed. No need to use your BAM file.

ADD REPLY
0
Entering edit mode

@b.nota i.e. 100 + 30 = 130 (sum of 7th column of featureCounts output)??

ADD REPLY
1
Entering edit mode

Yes, those are your read counts. If you have more samples you can use featureCounts for all BAM files in one run.

ADD REPLY
0
Entering edit mode
ADD COMMENT
0
Entering edit mode

I'm interested in processing from featureCounts not htseq.

ADD REPLY
0
Entering edit mode

You can use results from any tool (matrix of your count sample wise or single sample along with length column). It is just general script for calculating RPKM.

Example input: (start with' #' to retain headers in your output)

#Gene Sample1 Sample2 Sample3 Length_of_the_gene

Gene1 10 20 40 1200

Gene2 23 45 60 800

Example run:

perl rpkm_script_beta.pl sample_count_test.count 2:4 5 > sample_count_test.rpkm

2 to 4 th column contains read counts and fifth column contains gene length.

sample output:

Gene Sample1 Sample2 Sample3 Length_of_the_gene

Gene1 rpkm_val rpkm_val rpkm_val 1200

Gene2 rpkm_val rpkm_val rpkm_val 800

ADD REPLY
0
Entering edit mode

HEllo,

As you mentioned above my data input is given as following:

#Geneid STB32   STB34   STB36   STB40   STB65   STB79    Length
ENSG00000000003 1247    1938    1121    1307    379 1920  11231
ENSG00000000005 0   15  34  7   1   3   15678
ENSG00000000419 1420    1874    1684    915 2431    1603   18768
ENSG00000000457 2211    1302    887 1902    1226    1990    33987
ENSG00000000460 1765    1039    934 678 618 483   44567
ENSG00000000938 315 548 879 475 562 449   35467

This is a csv file first. Then I chnaged the format into .count and gave the command with rpkm_script_beta.pl and the output has all zeros.

perl rpkm_script_beta.pl sample_count_test.count 2:7 8 > sample_count_test.rpkm

Could you please give me the reply.

ADD REPLY
0
Entering edit mode

Hi, the file looks absolutely fine (Just recheck once if everything is tab-delimited). Also could you please report few lines from your output ?

ADD REPLY
0
Entering edit mode

I save the file now into tab-delimited "sample_count_test.count". I got the following error after giving the command. Illegal division by zero at rpkm_script_beta.pl line 44, <> line 66.

ADD REPLY
0
Entering edit mode

Form this error 'Illegal division by zero at rpkm_script_beta.pl line 44, <> line 66' it looks like your file is not formatted properly or it is having non-numeric entries in the counts (only zeros can be handled by this script not non-numeric entries).

ADD REPLY
0
Entering edit mode

Ok. I saved the file now in tab-delimited and looks like it worked. I didn't get any error. But I don't understand why the Sample names are missing.

Geneid  0   0   0   0   0   0   Length
ENSG00000000003 2.245105716 3.553743948 1.681000503 2.734256037 0.640389439 2.830847732   11231
ENSG00000000005 0   0.020645275 0.038268215 0.010991544 0.001268242 0.003319966  15678
ENSG00000000419 1.221842092 1.642322123 1.206870618 0.914831441 1.963117553 1.129549323   18768
ENSG00000000457 1.009621537 0.605540041 0.337353618 1.009191598 0.525405897 0.744162668   33987
ENSG00000000460 0.187392801 0.112353281 0.082593707 0.083643296 0.061578808 0.041995259   44567
ENSG00000000938 0.276588688 0.490079389 0.642842565 0.484630475 0.463121332 0.322860668   35467
ADD REPLY
0
Entering edit mode

R u sure you used '#' in the beginning of the header ?

ADD REPLY
0
Entering edit mode

Hello,

Yes I used "#" for the Geneid as you showed in the example. But still I didn't get the sample names in my output. But based on input data can I replace "0" with sample names?

ADD REPLY
1
Entering edit mode

Yes the results will be in same order as the original input file.

ADD REPLY
0
Entering edit mode
7.2 years ago
James Ashmore ★ 3.4k

It's fine to use the column sums as the total read count for each of the samples.

ADD COMMENT

Login before adding your answer.

Traffic: 2670 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6