I have an RNAseq dataset where some of the samples were run twice, and my counts for those samples are the result of combining a first and second run's output. Since not all of these samples had a second run, I want to set up the second run's lane as a variable in my design. The way I have done this is to create a variable categorical variable "secondrun" that has "1" or "NA" and a conditional variable 'lanesecondrun' that nests within secondrun. As I understand it this should mean lanesecondrun is only considered when secondrun exists (like the commonly given example of a model where "age of spouse" is only considered when the variable "spouse" is turned on). Unfortunately this is giving me an error in the 'estimateDisp() step of the edgerR pipeline.
design <- model.matrix(~0 + Disease + Batch + secondrun + (secondrun:lanesecondrun) + RIN + Sex + PMI + Smoking + Ethanol, bigdemos) dge.estDisp.out <- estimateDisp(dge.tmm.normfact.out, design) Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05, : nrow(design) disagrees with ncol(y)
Looking at my design matrix, I see that 'secondrun' has been assigned a 1 for all 300 rows, despite the fact that some of the 300 subjects had "NA" assigned to that column of the demos file. I also only see two of the three levels from 'lanesecondrun' represented in the design matrix (the first non NA value in the column from the demos file, F6, has no column in the design matrix.) This is my first time setting up a nested design.
I feel like I'm missing something obvious here. Can anyone tell me what I have done wrong?