8 months ago by
Authors have managed to publish many bad RNAseq experiments in the past. In very early days this was at least part due to a lack of knowledge in the field, but even once things were understood in the field, papers are often reviewed by people who do not have this specialist understanding.
The need to do things more than once (replication) is one of the very first things we learn about science in school. It is no different because we are using big fancy technology. Knowing that I got 367 reads for a gene when a took a sample from condition A and 472 when I took a sample from condition B is of little use if I don't know how much samples from the sample condition vary from sample to sample. Now is true that if I sequenced a same library made from the same sample many times, I'd find that the variane of gene with an average of 367 was 367. But we are interested in much samples vary from each other, not how repeated mesaurements of the same sample vary.
The defects with RPKM are more subtle and arise from the fact that
- the sum of the total RPKM across all genes in a sample is not constant from sample to sample
- Even if it were, that would make RPKM a compositional measure - that is the RPKM of one gene is affected by the RPKM in other genes
- We only approximately know the distribution of RPKM. In the early days, many aruged that log RPKM was approximately normally distributed, but for a long time people argued about how approximate that approximation was. We now know what the precise distribution of counts is, and so that is a much better option for differential expression (although not for other analyses nececssarily).
edgeR will agree to process single replicate data, even though the results will not mean much (and reacent versions will warn you of such), but it will only do so from count data, not from RPKM. So either you are misunderstanding the methods in the paper you are referencing, or (more likely) the authors have not recoded what they did properly. This is one reason why many people are pressing for the release of analysis code instead of a written description of what was done.
There maybe ways to get some meaning from RPKM data, but it would not be using standard software, or at least no in the way it was intended, and the results would still not be fully robust. I know of no software designed to analyse RPKM data for differential expression.
All this is a long winded way of saying that a lot of publicly available RNA-seq data is not good data and you shouldn't waste your time on it. Not all science that got published should have done, and not all science that was published some time ago would get published now. Accept that not all published data is going to be useful.