How to filter for samples that show good concordance across replicates in DESEQ2 count data from RNA-SEQ?
2
3
Entering edit mode
4.7 years ago
nancy ▴ 90

I have some RNAseq data - n=2, treated vs control. I have carried out alignment & quantification via Hisat2 > Stringtie > PrepDE.py > DESeq2

I would like to filter out D.E. genes that show high variance between the two replicates. But I don't seem to have an objective way of doing this.

In the screenshot as shown, I'd like to retain data that looks like the top panel, while filtering out data that looks like the bottom panel. As you can see, genes with very poor concordance between reps also show up with good p-vals and adjusted p-vals, and I was wondering why that is so.

Thanks

RNA-Seq DESeq2 replicates prepDE.py R • 2.7k views
1
Entering edit mode
1
Entering edit mode

n=2, treated vs control

With only two samples any comparison you make is entirely invalid and all time you spend is wasted. You have no means to figure out if the variance you observe is biologically relevant variance, biological noise or technical issues.

The absolute minimal number of replicates per group would be 3, but ideally, if budget permits, the quality of your data would be much better with 5 or more replicates per group.

0
Entering edit mode

Hi WouterDeCoster,

Thanks for showing me how to add figures to the main post. Very useful!

I quite disagree with as extreme a view as "any comparison with n=2 entirely invalid and all time you spend is wasted". Often, we do an exploratory n=2 to figure out things like most appropriate timepoint, dosage, etc. There are a couple 100s of genes that show excellent concordance with n=2 and significant D.E., (and indeed are biologically meaningful) and to me, that is a valid and useful takeaway from the n=2 experiment. Of course there is no doubt n>>3 will improve all measurements.

Yes, one cannot infer anything from the data points showing extreme variance when n=2, but my question was more about why, such data points are not being flagged in some way with a high p-value, but in fact, are assigned p-values as low as well replicated data points.

6
Entering edit mode
4.7 years ago

For why these genes get called differential you have to understand DESeq2's approach to estimating per-gene dispersion.

The raison d'etre of DESeq2 and edgeR is to try to find accurate estimates for the each gene's dispersion with only a small number of replicates. They do this by sharing information between genes on the understanding that genes of a similar expression level are, on average, going to have a similar dispersion. Thus they estimate average dispersion for all the genes and fit a smoothed function to the mean-expression vs dispersion estimate. This represents their expected dispersion. Then for each gene they "shrink" that genes dispersion estimate towards the expected dispersion using a baysian proceedure. This means that genes with a small dispersion will have their dispersion increased and are therefore less likely to be significant, and for genes with a large dispersion, like these ones here, will have their dispersion decreased, making it more likely they will be significant.

In their paper they Love et al state:

If, on the other hand, an individual gene’s dispersion is far above the distribution of the gene-wise dispersion estimates of other genes, then the shrinkage would lead to a greatly reduced final estimate of dispersion. We reasoned that in many cases, the reason for extraordinarily high dispersion of a gene is that it does not obey our modeling assumptions; some genes may show much higher variability than others for biological or technical reasons, even though they have the same average expression levels. In these cases, inference based on the shrunken dispersion estimates could lead to undesirable false positive calls. DESeq2 handles these cases by using the gene-wise estimate instead of the shrunken estimate when the former is more than 2 residual standard deviations above the curve.

My guess is that this safety-value is "failing" in your case. This is probably because there is a large standard deviation in the distances of the gene-wise dispersion estimates from the expected dispersion (due to only having two reps), thus the large dispersions seen here would be less than 2sd from the expected curve and the shruken value for dispersion is used.

If you want to avoid this, you could use DESeq (rather DESeq2) and set the sharingMode="maximum". Here the maximum of the trend and the per-gene dispersion estimate is used, which is a more conservative approach (DESeq was way more conservative than other methods of its time).

0
Entering edit mode

Thanks for your detailed and well-explained response, I actually understood it and it makes sense to me now. And thanks for cleaning up my original post too. I learned something there too!

3
Entering edit mode
4.7 years ago

Check out the DESeq2 manual for Cooks Distance here. Cooks Distance is not applied to datasets with less than three replicates, but you can look at Mike Love's code sample here for an idea of how to apply it to your dataset.

maxCooks <- apply(assays(dds)[["cooks"]], 1, max)


From the Manual:

The results function automatically flags genes which contain a Cook’s distance above a cutoff for samples which have 3 or more replicates. The p values and adjusted p values for these genes are set to NA. At least 3 replicates are required for flagging, as it is difficult to judge which sample might be an outlier with only 2 replicates. This filtering can be turned off with results(dds, cooksCutoff=FALSE)

0
Entering edit mode

Thanks Andrew, I will check this function. However, it still strikes me odd that a p-value as low as p<10-5 is ascribed to a D.E. event where the two replicates show such little concordance. Since a low p-value indicates the event did not occur by chance, it is a bit counter-intuitive in these cases.