Question: How to calculate RPKM from featureCounts output
0
gravatar for gundalav
2.6 years ago by
gundalav290
La La Land
gundalav290 wrote:

Lets say the featureCounts output is this (for convenience only contain 2 genes:

Geneid  Chr     Start   End     Strand  Length  myfile.bam
Xkr4    chr1;chr1;chr1  3214482;3421702;3670552 3216968;3421901;3671498 -;-;-   3634    100
Npbwr1  chr1    5913707 5917398 -       3692   30

As far as I know the formula of RPKM is

RPKM = count_assigned_to_gene / (gene_length/1000 * totalReads /1e6)

My question is where can I get totalReads? Is it directly from the featureCounts output (100 + 30 ==130)?

Xkr4 = 100 / (3634/1000 * 130 / 1e6)

Or should I get it from BAM files e.g. bamtools stats.

rna-seq • 4.1k views
ADD COMMENTlink modified 2.6 years ago by James Ashmore2.7k • written 2.6 years ago by gundalav290
1
gravatar for Benn
2.6 years ago by
Benn7.8k
Netherlands
Benn7.8k wrote:

Why don't you use edgeR for RPKM calculation?

You'll have to import the counts file first, and then you can use the values from the Length column also for the calculation.

ADD COMMENTlink written 2.6 years ago by Benn7.8k

@b.nota in other words the totalReads in the calculation is not the total assigned reads to the genes? So I have to directly count from the BAM file? Would prefer to recode cause later need to calculate TPM as well.

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by gundalav290
1

From edgeR help page:

lib.size: library size, defaults to colSums(x)

So to answer your question, your totalReads should be the sum of the whole sample column indeed. No need to use your BAM file.

ADD REPLYlink written 2.6 years ago by Benn7.8k

@b.nota i.e. 100 + 30 = 130 (sum of 7th column of featureCounts output)??

ADD REPLYlink written 2.6 years ago by gundalav290
1

Yes, those are your read counts. If you have more samples you can use featureCounts for all BAM files in one run.

ADD REPLYlink written 2.6 years ago by Benn7.8k
0
gravatar for EagleEye
2.6 years ago by
EagleEye6.5k
Sweden
EagleEye6.5k wrote:

Perl solution:

https://github.com/santhilalsubhash/rpkm_rnaseq_count

R solution:

A: How to normalise read count per gene

ADD COMMENTlink modified 2.6 years ago • written 2.6 years ago by EagleEye6.5k

I'm interested in processing from featureCounts not htseq.

ADD REPLYlink written 2.6 years ago by gundalav290

You can use results from any tool (matrix of your count sample wise or single sample along with length column). It is just general script for calculating RPKM.

Example input: (start with' #' to retain headers in your output)

#Gene Sample1 Sample2 Sample3 Length_of_the_gene

Gene1 10 20 40 1200

Gene2 23 45 60 800

Example run:

perl rpkm_script_beta.pl sample_count_test.count 2:4 5 > sample_count_test.rpkm

2 to 4 th column contains read counts and fifth column contains gene length.

sample output:

Gene Sample1 Sample2 Sample3 Length_of_the_gene

Gene1 rpkm_val rpkm_val rpkm_val 1200

Gene2 rpkm_val rpkm_val rpkm_val 800

ADD REPLYlink modified 2.6 years ago • written 2.6 years ago by EagleEye6.5k

HEllo,

As you mentioned above my data input is given as following:

#Geneid STB32   STB34   STB36   STB40   STB65   STB79    Length
ENSG00000000003 1247    1938    1121    1307    379 1920  11231
ENSG00000000005 0   15  34  7   1   3   15678
ENSG00000000419 1420    1874    1684    915 2431    1603   18768
ENSG00000000457 2211    1302    887 1902    1226    1990    33987
ENSG00000000460 1765    1039    934 678 618 483   44567
ENSG00000000938 315 548 879 475 562 449   35467

This is a csv file first. Then I chnaged the format into .count and gave the command with rpkm_script_beta.pl and the output has all zeros.

perl rpkm_script_beta.pl sample_count_test.count 2:7 8 > sample_count_test.rpkm

Could you please give me the reply.

ADD REPLYlink written 2.2 years ago by Vasu410

Hi, the file looks absolutely fine (Just recheck once if everything is tab-delimited). Also could you please report few lines from your output ?

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by EagleEye6.5k

I save the file now into tab-delimited "sample_count_test.count". I got the following error after giving the command. Illegal division by zero at rpkm_script_beta.pl line 44, <> line 66.

ADD REPLYlink written 2.1 years ago by Vasu410

Form this error 'Illegal division by zero at rpkm_script_beta.pl line 44, <> line 66' it looks like your file is not formatted properly or it is having non-numeric entries in the counts (only zeros can be handled by this script not non-numeric entries).

ADD REPLYlink written 2.1 years ago by EagleEye6.5k

Ok. I saved the file now in tab-delimited and looks like it worked. I didn't get any error. But I don't understand why the Sample names are missing.

Geneid  0   0   0   0   0   0   Length
ENSG00000000003 2.245105716 3.553743948 1.681000503 2.734256037 0.640389439 2.830847732   11231
ENSG00000000005 0   0.020645275 0.038268215 0.010991544 0.001268242 0.003319966  15678
ENSG00000000419 1.221842092 1.642322123 1.206870618 0.914831441 1.963117553 1.129549323   18768
ENSG00000000457 1.009621537 0.605540041 0.337353618 1.009191598 0.525405897 0.744162668   33987
ENSG00000000460 0.187392801 0.112353281 0.082593707 0.083643296 0.061578808 0.041995259   44567
ENSG00000000938 0.276588688 0.490079389 0.642842565 0.484630475 0.463121332 0.322860668   35467
ADD REPLYlink written 2.1 years ago by Vasu410

R u sure you used '#' in the beginning of the header ?

ADD REPLYlink modified 2.1 years ago • written 2.1 years ago by EagleEye6.5k

Hello,

Yes I used "#" for the Geneid as you showed in the example. But still I didn't get the sample names in my output. But based on input data can I replace "0" with sample names?

ADD REPLYlink written 2.1 years ago by Vasu410
1

Yes the results will be in same order as the original input file.

ADD REPLYlink written 2.1 years ago by EagleEye6.5k
0
gravatar for James Ashmore
2.6 years ago by
James Ashmore2.7k
UK/Edinburgh/MRC Centre for Regenerative Medicine
James Ashmore2.7k wrote:

It's fine to use the column sums as the total read count for each of the samples.

ADD COMMENTlink written 2.6 years ago by James Ashmore2.7k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1835 users visited in the last hour