Hi all!
Here is my question and your input is well appreciated. Experiment has 3 groups and each group has only one sample. Three groups are control, treated 1, treated 2. Treated 1 and Treated 2 have library size of 18 and 20 million reads respectively. Control has a library size of 70 million reads. Library size numbers are raw read counts. In addition, control sample is from a different run. Sequencing is illumina and organism is human. I have following questions:
a) Can i use control from a different run?
b) Since the control has 3 times the reads compared to experimental samples, how can I normalize the reads?
c) Does TMM/rlog take care of huge discrepancy in read numbers?
d) Do I have to include batch in my model matrix and design (in edger)?
Any help is appreciated and thanks.
Do you mean run = from a different flow cell? From what I've seen flow cell effects are typically minor. Did the sample undergo the same library prep as the other samples?
Any of the established procedures will do. I suggest you quantify reads against a reference transcriptome with
salmon
, then aggregate the transcript level abundance estimates it produces to the gene level withtximport
. Normalization can be done e.g. with the RLE method fromDESeq2
or TMM fromedgeR
.rlog
itself is not used in DEG analysis. Check the manuals of the respective tools for details.I think yes, as long as you sequenced deep enough to capture most of the genes that are relevant. If you do shallow sequencing you probably have quite some dropouts so many genes missing and having counts of zero. That might violate the assumptions of the normalization procedure. 18mio raw reads (given library is ok) should probably be fine.
You do not have replicates right? Which comparisons do you want to make?
a) From a different run (experiment), not from different lane, that too, a different experiment conducted on a different date (i.e control is sequenced on day 1 and samples are sequenced on day 62). Libraries are made on different days, not on the same day.
b) Since there are no replicates per group, i would use edgeR and BCV for eukaryotes
c) Thanks for clarifying the point
d) No replicates in any group. Client wants expt 1 vs control, expt2 vs control, expt 1 vs expt 2 (having single sample, in each group)
Well, the analysis is going to be exploratory at best. This BCV is highly sample specific, so choosing any value is arbitrary without replicates. Your customer designed a poor experiment. But if you are going to use any BCV simply follow the edgeR manual. Batch effect can have a strong impact, but again without replicates there is nothing you can do about it. Tell your customer to 1) not overly trust the results and 2) design better experiments in the future to have meaningful results and 3) most importantly, validate those genes they will use downstream with an independent method such as qPCR. That is at least what I would do.