edgeR design matrix error not of full rank
2
0
Entering edit mode
27 days ago
intropin • 0

Hello,

I am currently having an error when I am using the estimateDisp function on edgeR, my pheno table and design matrix looks like this:

       GROUP   AGE  GENDER
1  no_woman  26.00000  woman
2    no_man  22.18962    man
2 yes_woman  27.86650  woman
3  no_woman  21.55446  woman
4  no_woman  24.60362  woman
5  no_woman  26.47429  woman
6  no_woman  24.00000  woman
7  no_woman  29.08834  woman
8  no_woman  20.85291  woman
9   yes_man  26.28884    man
10    no_man 30.54005    man
11    no_man 25.92953    man
[...]


And that is my matrix:

Group <- y_transcripts$samples$group
Age <- y_transcripts$samples$AGE
Gender <- y_transcripts$samples$GENDER
design <- model.matrix(~0+Group+Gender+Age)
Contrasts <- makeContrasts((Groupyes_woman-Groupno_woman)-(Groupyes_man-Groupno_man),levels=colnames(design))

design
Groupno_man Groupno_woman Groupyes_man Groupyes_woman Genderwoman      Age
1            0             1            0              0           1 26.00000
2            1             0            0              0           0 22.18962
3            0             0            0              1           1 27.86650
4            0             1            0              0           1 21.55446
5            0             1            0              0           1 24.60362
6            0             1            0              0           1 26.47429
7            0             1            0              0           1 24.00000
8            0             1            0              0           1 29.08834
9            0             1            0              0           1 20.85291
10           0             0            1              0           0 26.28884
11           1             0            0              0           0 30.54005
[....]
attr(,"assign")
[1] 1 1 1 1 2 3
attr(,"contrasts")
attr(,"contrasts")$Group [1] "contr.treatment" attr(,"contrasts")$Gender
[1] "contr.treatment"


Afterwards I filtered by minimum expression using defaults and normalized the library size

 keep <- filterByExpr(y_transcripts, group=Group)
y_transcripts <- y_transcripts[keep, , keep.lib.sizes=FALSE]

y_transcripts <- calcNormFactors(y_transcripts)


I got an error when using estimateDisp

y_transcripts <- estimateDisp(y_transcripts, design, robust=TRUE)

>     Error in glmFit.default(sely, design, offset = seloffset, dispersion = 0.05,  :
>       Design matrix not of full rank.  The following coefficients not estimable:
>      Genderwoman


When I exclude Gender from the design matrix it works, I don't know what is my mistake.

edgeR contrastmatrix design glmfit • 734 views
1
Entering edit mode

I might be wrong but think this comes from variable Age being numerical and not categorial, so that when you build your design matrix, it doesn't have all possible combination of factors, that is what "not of full rank" is trying to tell you. For example you have a combination: no_woman 26.00000 woman (whatever that is supposed to mean) but you don't have e.g. no_woman 26.00000 man or yes_woman 26.00000 woman. Of course you couldn't, because each age value is float and unique. You could try to leave out Age from the formula like design <- model.matrix(~Group+Gender) and it might work, I am not sure if and how edgeR can deal with non-categorial variables, so you could investigate that further. If it cannot dealt with that, you could try to discretize the age into ranges.

Edit :So I found this: https://support.bioconductor.org/p/109814/ with a similar setting, indicating it should work just fine. Try to follow the work-flow given there.

0
Entering edit mode

I have been trying to solve the issue, even without Age as a covariate I still get the same error. With the design matrix you suggested containing the intercept, its not possible to build up a contrast matrix.

1
Entering edit mode

I think you should follow the advice by Gordon, it seems like gender was encoded twice, if you encode group only with levels yes and no instead of yes_woman, no_woman, etc. it might work.

4
Entering edit mode
26 days ago

You can't do group and gender together, because all your "_man" samples are all men and all your "_woman" are all women.

Pick group or gender, not both Frankly, group will be hard with so few replicates, and only one yes_male and yes_female.

The ages all being unique can be a problem if the software thinks Age is a character string, but if it understands it's a number, I think it will work. Though with so few samples, I'm not sure how good a model of age contributions can be.

0
Entering edit mode

Yes, that must be it. Shouldn't the group also be independent of gender? I guess that group is some sort of treatment or disease status. Now it looks like they are combined. So group should be either yes or no. Still there are too few samples in the yes group.

0
Entering edit mode

OP has only shown you the first 11 rows of the pheno table -- see the [...] entries indicating that there are further rows that have not been shown. So you do not actually know how many samples or replicates there are and there may be plenty.

0
Entering edit mode

Yes, there are 52 samples, I didn't post them all to avoid too much information on my post.

4
Entering edit mode
26 days ago
Gordon Smyth ★ 3.5k

The design matrix is over-parametrized because you have included gender in the model twice, once as part of the group factor and then again a factor in its own right.

You can see the problem by looking at the design matrix. There are only four possible combinations of yes/no and male/female yet you have created a design matrix with 5 columns relating to group and gender. How would you expect to estimate five coefficients from only four groups?

You need to make a decision about what model you are trying to fit. If you want to allow separate yes/no effects for each gender than you should define:

design <- model.matrix(~0+Group+Age)


If you are willing to assume similar yes/no effects for each gender then you should define:

design <- model.matrix(~YesNo+Gender+Age)


where YesNo is a factor taking levels yes or no.

You have to make one of these choices -- you cannot somehow do both!

0
Entering edit mode

I was thinking about the post, now it makes sense, indeed it is over-parametrized. My aim is to see the differential gene expression in sick females compared to sick males using the contrast matrix, corrected for the covariates gender and age.

If I use this design matrix, will I be still correcting for gender as well?:

design <- model.matrix(~0+Group+Age)

1
Entering edit mode

You will be fully accounting for gender, as your proposed contrasts clearly show. The concept of "correcting for gender" does not quite apply here and is not the right way to think about the model.