Bulk RNA-seq time course data with unknown sample sizes?
0
0
Entering edit mode
6 weeks ago
Dunois ★ 2.0k

Hi all,

I have a RNA-seq time course data set from a eukaryotic plankton species. For each time point, a single "replicate" was sequenced, which comprised of RNA from about 20-40 individuals sacrificed together.

All I'm really looking to do at this juncture is see if any of the genes have a circadian expression pattern over the time period the sampling was performed. I'll probably be using metacycle to perform the analysis.

But now I'm wondering if the data needs to be normalized to account for the fact that the number of individual organisms that contributed towards the RNA for each time step is different (and also unknown). Most publications I've looked at simply talk about biological replicates, and advise using "sufficient" replicates (which isn't really helpful here here).

I would assume some kind of normalization/correction would be necessary here to minimize the false positive/false negative rate.

For instance with this hypothetical scenario below for a toy RNA sequence,

                             time_1         time_2
num_individuals                  10             20
expression_per_individual       500            400
total_number_of_transcripts    5000           8000


time_2 will incorrectly be indicated as being over-expressed w.r.t. time_1 if the data isn't corrected for to account for the variation in the number of individuals at both time steps.

So my question would be:

• Does this need to be accounted for at all?
• Would this basically translate to variation in sequencing depth?
• Is this addressable by the standard normalization techniques like TMM from edgeR?
• If not, how (if at all possible) can I account for this?
• Are there any other pitfalls and/or caveats I should keep in mind for time course RNA-seq analysis?
• Are there any important publications I should take a look at?

I apologize if this is a trivial question, but I would rather gather some expert advice (statistics is not my forte) than go ahead blind and end up doing bad science.

Yours inputs are much appreciated.

0
Entering edit mode

You have one pool of RNA, that makes it n=1. The readout is the average over the sacrified population. I see no way to come up with a credible way of deconvoluting your pools. Normal normalization, e.g. edgeR will do here. How many timepoints do you have? With n=1 per TP you need quite some TPs to have the power to call oscillation.

0
Entering edit mode

HI ATpoint . Thanks for the feedback. I have 13 time points; sampled every 4 hours over a 14 hour time period. I have some additional data adding up to about 28 time points in total, but these were not sampled periodically (they were sampled in-between these 4 hour time points, and were sequenced to generate longer reads).

1
Entering edit mode

I personally use the normalized counts from edgeR for circadian analysis, or, alternatively, if you fit your cosinor model with something like limma-trend, then the logcounts from edgeR.

0
Entering edit mode

I guess I'll just go with TMM from edgeR then. Thanks a lot!!