Hi qicaibiology,
Authors have managed to publish many bad RNAseq experiments in the past. In very early days this was at least part due to a lack of knowledge in the field, but even once things were understood in the field, papers are often reviewed by people who do not have this specialist understanding.
The need to do things more than once (replication) is one of the very first things we learn about science in school. It is no different because we are using big fancy technology. Knowing that I got 367 reads for a gene when a took a sample from condition A and 472 when I took a sample from condition B is of little use if I don't know how much samples from the sample condition vary from sample to sample. Now is true that if I sequenced a same library made from the same sample many times, I'd find that the variane of gene with an average of 367 was 367. But we are interested in much samples vary from each other, not how repeated mesaurements of the same sample vary.
The defects with RPKM are more subtle and arise from the fact that
- the sum of the total RPKM across all genes in a sample is not constant from sample to sample
- Even if it were, that would make RPKM a compositional measure - that is the RPKM of one gene is affected by the RPKM in other genes
- We only approximately know the distribution of RPKM. In the early days, many aruged that log RPKM was approximately normally distributed, but for a long time people argued about how approximate that approximation was. We now know what the precise distribution of counts is, and so that is a much better option for differential expression (although not for other analyses nececssarily).
edgeR will agree to process single replicate data, even though the results will not mean much (and reacent versions will warn you of such), but it will only do so from count data, not from RPKM. So either you are misunderstanding the methods in the paper you are referencing, or (more likely) the authors have not recoded what they did properly. This is one reason why many people are pressing for the release of analysis code instead of a written description of what was done.
There maybe ways to get some meaning from RPKM data, but it would not be using standard software, or at least no in the way it was intended, and the results would still not be fully robust. I know of no software designed to analyse RPKM data for differential expression.
All this is a long winded way of saying that a lot of publicly available RNA-seq data is not good data and you shouldn't waste your time on it. Not all science that got published should have done, and not all science that was published some time ago would get published now. Accept that not all published data is going to be useful.
I have so much to comment but so little time. RPKM is bad practice, one replicate is bad design. I don't have access to f1000 so I can't see the paper.
Thanks for your reply. The title of the f1000 paper is: RNA-seq analysis is easy as 1-2-3 with limma, Glimma and edgeR [version 3; peer review: 3 approved]
I know it is a bad design to have only 1 replicates. My point is to get some clues for the function of the genes I am studying.
Thanks again,
Cai
Can it be accessed outside of f1000? Can you try and share another link?
I fixed the link. Biostars has the habit of including the
)
after URLs in parentheses as part of the URL.OK. Appreciate a lot
This is about analyzing the dex I have tried. It needs 2 replicates.
I can send it to your email.
What I am asking is an approach for PRKM/FRKM measurement and the subsequent quantification.
The piece of software that generates read counts usually provide FPKM. RSEM for instance return it. Again, FPKM shouldn't usually be used, especially for differential expression.
Thank you very much for your advice! I will figure out a method to dig out the data from published database
Yes, but, do not forget 2 key things here:
So, the data is 'not good'
OK. I managed to get 2 database from 2 different papers which has 1 replicate from each. Now I can practice my analysis with CPM.