Best practices for differential expression analysis with low-yield Nanopore/ONT direct cDNA data?
0
0
Entering edit mode
11 months ago
tw_140 • 0

I have some cDNA data comparing various mutants to a control sample (each with 3+ replicates) that were obtained using a modified version of the direct cDNA sequencing kit (SQK-DCS109) from ONT.

Due to the modified protocol, we’ve gotten rather low yields for usable reads in downstream analyses. For example, the control replicates only range from 80k-290k total reads. Out of the 20k genes represented in my dataset (at least one read for each gene was identified), ~14k of the genes have less than 20 reads in each sample.

We would like to do a DE analysis between the control and mutant samples; however, I’m not sure what the best practices are for datasets with such low total counts. In the past with Illumina data, I used edgeR. For this ONT dataset, I thought that changing the min.count parameter in the edgeR::filterByExpr function to min.count = 1 would be appropriate, but I’m not entirely sure.

I’m currently trying to use ONT’s epi2me-labs/wf-transcriptomes pipeline, but I’m wondering if this is the best option for low-yield data? Does anyone have any experience or references with low-yield ONT on the best choices for DE analysis?

Thank you!

differential-expression RNA-Seq ONT Nanopore • 852 views
ADD COMMENT
0
Entering edit mode

I don't have a suggestion for low yield DE analyses, but I am curious about the samples. Did you conduct any ribo-depletion steps prior to sequencing? If not, what proportion of your reads are ribosomal?

ADD REPLY
0
Entering edit mode

Yes, I performed ribosome depletion using an RNAse H based method.

ADD REPLY
0
Entering edit mode

That's good. But I still suspect this data might be really difficult to analyse. I have a handful of suggestions I would do to assess whether this data is useable.

  1. Look at the variation in known housekeeping genes to see if you can move forward with it. You might find that some housekeeping genes aren't even represented in your dataset. Even ones with usually high expression.

  2. Do a PCA based on the transcript expression, do replicates cluster together?

With such low read counts and significant variation around those values, I would expect dropout of transcripts among samples.

ADD REPLY
0
Entering edit mode

Thank you for the suggestions!

I checked some housekeeping genes and they are represented in the dataset for all samples (albeit, some at low counts; none > 300 raw counts for any replicates).

I did PCA plots before and after my attempt at normalization. After the normalization, there is some clustering, but not the best:

enter image description here

Do you think this data is usable for a DE analysis?

ADD REPLY
0
Entering edit mode

I would instead say if there is no detectable expression in a sample, then the housekeeping genes are not represented. It's unclear how many samples don't have detectable levels from your answer.

How much variation do the PCA axes represent? And if this is global expression, I generally expect that most samples would cluster somewhat together with replicates closer together, so to me some of your samples look okay in this regard.

It's not for people on here to say if your data is useable, you need to make that call. I think you need to get a better idea of the number of dropouts you have. If you can get a rough idea of the number of genes that should be there that aren't (i.e., housekeeping genes, expected highly expressed genes, etc...) you can make that call. Though I am skeptical that 80k reads is enough to do a transcriptome-wide DE analysis.

ADD REPLY
0
Entering edit mode

Thank you, I really appreciate your insight! I will look into the dropouts I have.

ADD REPLY

Login before adding your answer.

Traffic: 858 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6