Question: DESeq2 converting numeric to factor for unknown reason - SOLVED
0
gravatar for benjamin.aaron.taylor
17 months ago by
benjamin.aaron.taylor0 wrote:

I'm attempting to perform differential expression analysis using DESeq2 on a dataset of ~90 samples and ~20000 transcripts. I have a number of variables I'd like to test against, some of which are qualitative (factors) and some of which are quantitative (numeric). A sample of my data are below:

> head(data.phenotype.test)
# A tibble: 6 x 7
  Novogene_ID BORIS_ID     treatment queenness ovaries   elo sampletime
  <fct>       <fct>        <fct>         <dbl>   <dbl> <dbl> <fct>     
1 W03BLGR     003W-blu-grn Qr12        0.0202    124.  1000  12        
2 W03GR       003W-grn     Qr12        0.00210    66.8  923. 12        
3 W03SL       003W-sil     Qr12        0.969     316.  1077. 12        
4 W04BLOR     004W-blu-ora Qr12        0.00168    66.4  876. 12        
5 W04GRSL     004W-grn-sil Qr12        0.175     210.   931. 12        
6 W04YLWH     004W-yel-whi Qr12        0.984     302.  1192. 12

Now, let's say I want to test which transcripts are differentially expressed together with a numeric variable such as 'ovaries':

dds.transcript.ovaries = DESeqDataSetFromMatrix(countData = data.transcript.count.ovaries,
                                        colData = as.matrix(data.phenotype.ovaries),
                                        design = ~ ovaries)

This should work, but DESeq2 does something strange- it converts 'ovaries' to a factor, with the message "some variables in design formula are characters, converting to factors". But there are no character values in the supplied data frame, and certainly not in the column 'ovaries'! This conversion to factors is a big problem for me because, having converted a numeric column to factors, DESeq2 unsurprisingly reports that differential expression analysis cannot be performed because there are an equal number of factor levels to samples.

I have found a temporary workaround: if I reduce the data frame to just the 'ovaries' column, DESeq2 no longer converts the numeric data to factor levels and I'm able to perform differential expression analysis as normal. The problem is that I'd eventually like to perform multivariate analyses, e.g. with a design like:

~ ovaries + elo + treatment

However, doing so isn't going to be possible if DESeq2 keeps converting the numeric columns to factors, which it persistently does unless the numeric columns are supplied alone (which of course they can't be for a multivariate design).

What's going on here? Why does DESeq2 convert the numeric columns to factors, seemingly having misinterpreted those columns as characters?

NB: I have replicated this issue with different data on a different machine and the result is the same, so I don't believe this an issue with my R setup.

Edit: Solved! Over at stack exchange: https://bioinformatics.stackexchange.com/questions/8808/why-does-deseq2-convert-numeric-columns-to-factor-during-differential-expression

rna-seq R • 2.0k views
ADD COMMENTlink modified 17 months ago • written 17 months ago by benjamin.aaron.taylor0
1
gravatar for Carlo Yague
17 months ago by
Carlo Yague5.2k
Canada
Carlo Yague5.2k wrote:

"some variables in design formula are characters, converting to factors". But there are no character values in the supplied data frame, and certainly not in the column 'ovaries'!

This is probably the key to your problem, there must be hidden character values in ovaries that you are not aware of.

What is the output of str(data.phenotype.test) ?

have you tried to force the conversion to numeric ? data.phenotype.test$ovaries = as.numeric(data.phenotype.test$ovaries)

ADD COMMENTlink written 17 months ago by Carlo Yague5.2k

Hi Carlo, thanks for your reply.

I've tried forcing the conversion to numeric but it hasn't made a difference. Here's the structure of the df:

> str(data.phenotype.test)
Classes ‘tbl_df’, ‘tbl’ and 'data.frame':   64 obs. of  7 variables:
 $ Novogene_ID: Factor w/ 64 levels "W03BLGR","W03GR",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ BORIS_ID   : Factor w/ 64 levels "003W-blu-grn",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ treatment  : Factor w/ 4 levels "Qc","Qr12","Qr3",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ queenness  : num  0.02017 0.0021 0.96853 0.00168 0.17529 ...
 $ ovaries    : num  124.1 66.8 316.1 66.4 209.9 ...
 $ elo        : num  1000 923 1077 876 931 ...
 $ sampletime : Factor w/ 3 levels "0","3","12": 3 3 3 3 3 3 3 3 3 3 ...
 - attr(*, "na.action")= 'omit' Named int  1 2 3 4 5 6 7 8 9 10 ...
  ..- attr(*, "names")= chr  "1" "2" "3" "4" ...
ADD REPLYlink written 17 months ago by benjamin.aaron.taylor0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1931 users visited in the last hour