I have some cDNA data comparing various mutants to a control sample (each with 3+ replicates) that were obtained using a modified version of the direct cDNA sequencing kit (SQK-DCS109) from ONT.
Due to the modified protocol, we’ve gotten rather low yields for usable reads in downstream analyses. For example, the control replicates only range from 80k-290k total reads. Out of the 20k genes represented in my dataset (at least one read for each gene was identified), ~14k of the genes have less than 20 reads in each sample.
We would like to do a DE analysis between the control and mutant samples; however, I’m not sure what the best practices are for datasets with such low total counts. In the past with Illumina data, I used edgeR. For this ONT dataset, I thought that changing the min.count parameter in the edgeR::filterByExpr function to min.count = 1 would be appropriate, but I’m not entirely sure.
I’m currently trying to use ONT’s epi2me-labs/wf-transcriptomes pipeline, but I’m wondering if this is the best option for low-yield data? Does anyone have any experience or references with low-yield ONT on the best choices for DE analysis?
Thank you!
I don't have a suggestion for low yield DE analyses, but I am curious about the samples. Did you conduct any ribo-depletion steps prior to sequencing? If not, what proportion of your reads are ribosomal?
Yes, I performed ribosome depletion using an RNAse H based method.
That's good. But I still suspect this data might be really difficult to analyse. I have a handful of suggestions I would do to assess whether this data is useable.
Look at the variation in known housekeeping genes to see if you can move forward with it. You might find that some housekeeping genes aren't even represented in your dataset. Even ones with usually high expression.
Do a PCA based on the transcript expression, do replicates cluster together?
With such low read counts and significant variation around those values, I would expect dropout of transcripts among samples.
Thank you for the suggestions!
I checked some housekeeping genes and they are represented in the dataset for all samples (albeit, some at low counts; none > 300 raw counts for any replicates).
I did PCA plots before and after my attempt at normalization. After the normalization, there is some clustering, but not the best:
Do you think this data is usable for a DE analysis?
I would instead say if there is no detectable expression in a sample, then the housekeeping genes are not represented. It's unclear how many samples don't have detectable levels from your answer.
How much variation do the PCA axes represent? And if this is global expression, I generally expect that most samples would cluster somewhat together with replicates closer together, so to me some of your samples look okay in this regard.
It's not for people on here to say if your data is useable, you need to make that call. I think you need to get a better idea of the number of dropouts you have. If you can get a rough idea of the number of genes that should be there that aren't (i.e., housekeeping genes, expected highly expressed genes, etc...) you can make that call. Though I am skeptical that 80k reads is enough to do a transcriptome-wide DE analysis.
Thank you, I really appreciate your insight! I will look into the dropouts I have.