I'm attempting to perform differential expression analysis using DESeq2 on a dataset of ~90 samples and ~20000 transcripts. I have a number of variables I'd like to test against, some of which are qualitative (factors) and some of which are quantitative (numeric). A sample of my data are below:
> head(data.phenotype.test) # A tibble: 6 x 7 Novogene_ID BORIS_ID treatment queenness ovaries elo sampletime <fct> <fct> <fct> <dbl> <dbl> <dbl> <fct> 1 W03BLGR 003W-blu-grn Qr12 0.0202 124. 1000 12 2 W03GR 003W-grn Qr12 0.00210 66.8 923. 12 3 W03SL 003W-sil Qr12 0.969 316. 1077. 12 4 W04BLOR 004W-blu-ora Qr12 0.00168 66.4 876. 12 5 W04GRSL 004W-grn-sil Qr12 0.175 210. 931. 12 6 W04YLWH 004W-yel-whi Qr12 0.984 302. 1192. 12
Now, let's say I want to test which transcripts are differentially expressed together with a numeric variable such as 'ovaries':
dds.transcript.ovaries = DESeqDataSetFromMatrix(countData = data.transcript.count.ovaries, colData = as.matrix(data.phenotype.ovaries), design = ~ ovaries)
This should work, but DESeq2 does something strange- it converts 'ovaries' to a factor, with the message "some variables in design formula are characters, converting to factors". But there are no character values in the supplied data frame, and certainly not in the column 'ovaries'! This conversion to factors is a big problem for me because, having converted a numeric column to factors, DESeq2 unsurprisingly reports that differential expression analysis cannot be performed because there are an equal number of factor levels to samples.
I have found a temporary workaround: if I reduce the data frame to just the 'ovaries' column, DESeq2 no longer converts the numeric data to factor levels and I'm able to perform differential expression analysis as normal. The problem is that I'd eventually like to perform multivariate analyses, e.g. with a design like:
~ ovaries + elo + treatment
However, doing so isn't going to be possible if DESeq2 keeps converting the numeric columns to factors, which it persistently does unless the numeric columns are supplied alone (which of course they can't be for a multivariate design).
What's going on here? Why does DESeq2 convert the numeric columns to factors, seemingly having misinterpreted those columns as characters?
NB: I have replicated this issue with different data on a different machine and the result is the same, so I don't believe this an issue with my R setup.
Edit: Solved! Over at stack exchange: https://bioinformatics.stackexchange.com/questions/8808/why-does-deseq2-convert-numeric-columns-to-factor-during-differential-expression