Question: edgeR DGElist Error: Negative counts not allowed.
0
gravatar for ma23
11 weeks ago by
ma2320
ma2320 wrote:

Hi !

I have a table (.tsv) with data, here are several rows from the top:

GeneID  SRR1177686      SRR1026955      SRR1027004      SRR1026928      SRR1177692      SRR1026905      SRR1026942      SRR1177684      SRR1026984      SRR1026912

ENSG00000223972 0       1       0       1       0       0       0       0       1       0       0       0       1       1       0       0       0       0       1

ENSG00000227232 32      38      34      81      38      15      7       47      93      68      4       17      24      31      46      19      0       69      41

ENSG00000278267 2       6       7       11      7       1       4       10      5       2       1       0       1       1       4       2       0       27      2

I read the file into R using read.delim:

countDF <- read.delim("/storage/DATA/DEE2/combined_DGE_table.tsv", row.names=1, check.names=FALSE)

And then I want to create a DGEList object:

dge <- DGEList(count=countDF)

But it throws an error "Negative counts not allowed".

What should I do to avoid that ? Check if there are any NA or negatives in my data and remove them ?

edger • 333 views
ADD COMMENTlink modified 11 weeks ago by Gordon Smyth990 • written 11 weeks ago by ma2320
2

Try to see how your countDF looks like, and see if there are indeed negatives or something else that is weird.

summary(countDF) # Look by min. of each sample to see if they are 0 or negative
ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Benn7.8k
2

If you correct for batch effect via ComBat, then you will likely have negative counts. This is why nobody should be using ComBat for RNA-seq counts.

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Kevin Blighe49k
1

I'm confused by your DGEList() call.

What is the object y?

You should be using a command like: dge <- DGEList(countDF)

ADD REPLYlink written 11 weeks ago by benformatics1.1k

May be you could suggest some pipline or something for estimating negative binomial (NB) distribution parameters ? I want to find out whether the data satisfy the law of NB.

ADD REPLYlink written 11 weeks ago by ma2320
2

Did you find any negative counts in your df? Or other oddities? If yes, your pipeline to produce a count table is not correct. It is no use to continue with such a set, so don't bother going into testing NB distribution or not. It would be helpful to show what you did to generate the count table, and how it looks like in your R environment (like I commented before, try to see how your countDF looks like, inspect it for errors, etc.).

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Benn7.8k

There are no negatives in my data, I have checked..

ADD REPLYlink written 11 weeks ago by ma2320

If you run str(MyData[,1:20]), what is the output?

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Kevin Blighe49k
    'data.frame':   58302 obs. of  20 variables:
     $ SRR1177686: int  0 32 2 1 0 0 0 0 0 152 ...
     $ SRR1026955: int  1 38 6 0 0 0 0 0 0 19 ...
     $ SRR1027004: int  0 34 7 0 0 0 0 0 0 32 ...
     $ SRR1026928: int  1 81 11 0 0 0 0 0 0 55 ...
     $ SRR1177692: int  0 38 7 0 0 0 0 2 0 11 ...
     $ SRR1026905: int  0 15 1 0 0 0 0 0 0 12 ...
     $ SRR1026942: int  0 7 4 0 0 0 0 0 0 14 ...
     $ SRR1177684: int  0 47 10 0 0 0 0 0 0 38 ...
     $ SRR1026984: int  1 93 5 0 0 0 0 0 0 33 ...
     $ SRR1026912: int  0 68 2 0 0 0 0 0 0 227 ...
     $ SRR1026987: int  0 4 1 0 0 0 0 0 0 11 ...
     $ SRR1026974: int  0 17 0 0 0 0 0 0 0 8 ...
     $ SRR1026960: int  1 24 1 0 0 0 0 0 0 2 ...
     $ SRR1177678: int  1 31 1 0 0 0 0 0 0 67 ...
     $ SRR1026939: int  0 46 4 0 0 0 0 0 0 43 ...
     $ SRR1177734: int  0 19 2 0 0 1 0 0 0 18 ...
     $ SRR1026963: int  0 0 0 0 0 0 0 0 0 9 ...
     $ SRR1177721: int  0 69 27 0 0 0 0 0 0 57 ...
     $ SRR1026886: int  1 41 2 0 0 0 0 0 0 50 ...
     $ SRR1026897: int  0 0 0 0 0 0 0 0 0 0 ...
ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by ma2320

...and what happens when you coerce it to a data matrix?

DGEList(count=data.matrix(countDF))

...or, possibly:

DGEList(count=as.matrix(countDF))

And also the output of:

table(countDF>=0)
ADD REPLYlink written 11 weeks ago by Kevin Blighe49k

The output of DGEList(count=data.matrix(countDF)):

Repeated column names found in count matrix
An object of class "DGEList"
$counts
                SRR1177686 SRR1026955 SRR1027004 ......
                ........
                ........
                ........
                SRR5981512 SRR5980983 SRR5980651 SRR5981590 SRR5980436
 [ reached getOption("max.print") -- omitted 5 rows ]
58297 more rows ...

$samples
  group lib.size norm.factors
1     1  9212384            1
2     1 11409546            1
3     1 12393257            1
4     1 16799268            1
5     1  9694151            1
4131 more rows ...

Warning message:
In DGEList(count = data.matrix(countDF)) : library size of zero detected

The output for DGEList(count=as.matrix(countDF)):

Error: Negative counts not allowed

And the output for table(countDF>=0):

TRUE 
240612354 
Warning messages:
1: In Ops.factor(left, right) : '>=' not meaningful for factors
2: In Ops.factor(left, right) : '>=' not meaningful for factors
3: In Ops.factor(left, right) : '>=' not meaningful for factors
4: In Ops.factor(left, right) : '>=' not meaningful for factors
5: In Ops.factor(left, right) : '>=' not meaningful for factors
6: In Ops.factor(left, right) : '>=' not meaningful for factors
7: In Ops.factor(left, right) : '>=' not meaningful for factors
8: In Ops.factor(left, right) : '>=' not meaningful for factors
9: In Ops.factor(left, right) : '>=' not meaningful for factors
ADD REPLYlink written 11 weeks ago by ma2320

There seems to be something wrong with your data frame (but we can't see what from your output), the last command suggests factors instead of integers. Can you also check the whole data frame (instead of only the first 20 columns)?

sapply(countDF, class)
ADD REPLYlink written 11 weeks ago by Benn7.8k

I can't put here the hole output becouse it is too large. Here is a part of it:

SRR1177686 SRR1026955 SRR1027004 SRR1026928 SRR1177692 SRR1026905 SRR1026942 SRR1177684 
"integer" "integer" "integer" "integer" "integer" "integer" "integer" "integer"
...............................................................................................................
SRR2049767  GeneID SRR3235827 SRR3235971 SRR3235963 SRR3235891 SRR3236033 SRR3235882 
"integer"   "factor"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
...............................................................................................................
SRR3235892 SRR3235793 SRR3235839 SRR3235932 SRR3235878 SRR3236073 SRR3235854 SRR3235818 
 "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
 [ reached getOption("max.print") -- omitted 3136 entries ]
ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by ma2320

Yeah, my point was, that you could look for any "factor" columns instead of "integer", the warning seems to find 9 "factors". Trouble shoot your own data is part of bioinformatics, try to figure out how your data frame looks like. I suspect some columns with factors at the end maybe? Maybe try to see the last 9 (how many columns does your data frame has?).

sapply(countDF[, 3000:3009], class) # See the last 9 columns assuming you have 3009 columns
ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Benn7.8k

Here they are:

SRR5981836 SRR5982070 SRR5980672 SRR5981018 SRR5981994 SRR5981512 SRR5980983 SRR5980651 
 "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer"  "integer" 
SRR5981590 SRR5980436 
 "integer"  "integer"
ADD REPLYlink written 11 weeks ago by ma2320

Okay I hope you can trouble shoot it (yourself), check if the dimensions are the same as expected, if the headers and rownames are correct, etc. Good luck!

ADD REPLYlink modified 11 weeks ago • written 11 weeks ago by Benn7.8k
1

BTW, there is GeneID as a column in your dataset, see your two posts up. This is a factor, like expected your data frame is not correct.

ADD REPLYlink written 11 weeks ago by Benn7.8k

Yes, I see. But I don't understand why is it there .. It should only be the first column.

ADD REPLYlink written 11 weeks ago by ma2320
1

I don't understand either, I haven't seen your code. But at least now you know what the problem is so you can fix it. Good luck with it!

ADD REPLYlink written 11 weeks ago by Benn7.8k
1

Ok, thank you for the help !

ADD REPLYlink written 11 weeks ago by ma2320

Both DESeq2 and EdgeR (I believe) model your data as a negative binomial. There is in-built QC in these packages for you to check how good was the model fit

ADD REPLYlink written 11 weeks ago by Kevin Blighe49k
2
gravatar for Friederike
11 weeks ago by
Friederike5.2k
United States
Friederike5.2k wrote:

What should I do to avoid that ? Check if there are any NA or negatives in my data

Yes.

and remove them ?

No. You should find out why these values are in your count table to begin with and re-generate the count table in a way that will not have values that clearly cannot represent read counts.

ADD COMMENTlink written 11 weeks ago by Friederike5.2k
2
gravatar for Gordon Smyth
11 weeks ago by
Gordon Smyth990
Australia
Gordon Smyth990 wrote:

From the information shown in your comments above, it appears that your file combined_DGE_table.tsv actually has 10 columns of GeneIDs instead of just one. It looks like someone has merged together 10 different data.frames of counts and written out the combined data.frame to the tsv file without removing the GeneID columns. The first column of each of the individual data.frames would have contained GeneIDs, so there are 10 columns of GeneIDs in the combined file.

When edgeR tries to interpret the data.frame as a matrix, the GeneID columns will force the matrix to be of character mode instead of numeric, and any GeneID that starts with punctuation (like _ENS) will be interpreted by R as being negative. Hence the error message.

You can remove the GeneID columns by

cl <- sapply(countDF, class)
j <- which(cl == "factor")
countDF <- countDF[, -j]
dge <- DGEList(countDF)

If you upgrade to the latest Bioconductor release, which is Bioconductor 3.9 for R 3.6.0, the DGEList() function now issues a much more informative error message in this type of situation.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by Gordon Smyth990

Ok, I understand it now, thank you!

ADD REPLYlink written 11 weeks ago by ma2320
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1266 users visited in the last hour