Question: Calculating Rpkm For Rna-Seq Data Including Several Samples Each Condition
2
7.8 years ago by
narges180
Finland
narges180 wrote:

Hi,

I have asked about how to calculate the RPKM for RNA-seq data and I got the answer. Now I wanted to calculate the RPKM for an experiment which has 8 samples for each condition(8 sample for two groups). Do you think it is a good Idea if I calculate the RPKM for each sample of a condition separately and then calculate the mean for each condition? I would be grateful if some one could refer me to a paper or something which has done it. Thank you in advance

rpkm rna-seq • 4.9k views
modified 7.8 years ago by swbarnes29.1k • written 7.8 years ago by narges180

Are your 8 samples replicates? If so, what kind? What is your end goal? e.g., are you trying to identify significantly differentially expressed genes between the two groups? What software/package(s) are you using to process your RNA-seq data?

yes, they are 8 biological replicates for each group. And then the goal is to compare gene expression level and ofcourse the finite aim is to find DE genes but not now.

hi, how did u calculate RPKM values? do you have single end data? i also want to know to calculate RPKM value.reply urgent. thank you in advance

5
7.8 years ago by
Berlin, Germany
johannes.helmuth110 wrote:

Hello narges,

I assume 8 samples are sequence with the same technique (possibly the same flow cell).

There are different possibilities for what you want to do. Below I explain the two most common ones:

1. Use raw mapping counts ( e.g. count with bedtools) for calling differentially expressed genes and plug them in into DESeq. The idea in this case is to keep the expression estimation as simple as possible because you compare same entities: Is gene x differentially expresssed in cond1-vs-cond2? Therefore, there is the possibility to specify biological replicates to calculate expression variance in a specific condition. This variance is than utilized to judge expression variance between conditions
2. Use cuffdiff for calling differentially expressed transcripts. This program uses some optimizing function to identify expression of different transcripts (exon skipping,...) and, thus, can call differential splicing events.

Depending on the granuality you need (differential gene or transcript calling), you can choose one of the steps.

3
7.8 years ago by
swbarnes29.1k
United States
swbarnes29.1k wrote:

First, you need to ask the people who submitted the samples if they are true biological replicates, or technical replicates.

Technical replicates are good for knowing how much variability your library prep and instrument add. A technical replicate would be like taking the liver from one mouse, cutting it into 4 pieces, and treating them like 4 different samples. Any variation between the samples should be an artifact of the library prep and sequencing procedure. So if the exact same sample prepped multiple ways leads to big swings in RPKM, then you know that your prep is lousing things up, and you are going to have very little precision in your estimate of what the "real" expression was in the one sample.

In general, Illumina instruments do a good job with technical replicates, your samples should be very, very similar to each other, and I think if that's the case, combining them might be okay.

Biological replicates are when you, say, expose 4 organisms to the same condition, and differences between the biological replicates is likely not artifacts, but are due to real variations between organisms. You hope that your condition is powerful enough that the difference between the samples will be quite a bit smaller than the differences between two organisms exposed to different conditions.

For instance, let's say you check one control animal, and one condition animal. The control has an RPKM of 3, in one gene, the other animal has an RPKM of 8. A big difference, right?

Well, now you check a bunch of different control animals, and you see that the range of RPKM among control animals is 2-10, and the range from your condition animals was 3-11. Now, it looks like the condition doesn't actually change expression of that gene; it's naturally pretty variable, and the two sets of animals look pretty much the same.

So for biological duplicates, you need to keep them separate, because you need to know the average, and the variance for each gene. Because if there is a lot of organism-to-organism variability, you need to know how significant that is. And combining all the biological replicates together will lose that.

So biological replicates are required for good RNA-seq experiments. But people sometimes cut corners, so don't assume that that's what you have.

It's not the most sophisticated analysis around, but working out the averages and variances for each gene in each group would be a useful, simple place to start. Then, do unpaired t-tests, which look at the averages of each group, and their variances, and give you a number telling you how likely it is that your conditions really is significantly changing the expression if the gene.

Thank you so much for the complete explanation. The replicates are 8 biological replicates for each condition and before finding DE genes between these two conditions, I am supposed to find the gene expression level for each condition within their 8 biological replicates.

I am supposed to find the gene expression level for each condition within their 8 biological replicates.

You can retrieve transcript FPKMs (analogous to RPKM) by using cufflinks on the aligned reads for each condition. The aligned reads are normally saved as .bam file after the mapping procedure. Based on this you can easily call differentially expressed genes or transcripts.