Question: logFC in topTable problem- almost identical to average expression
1
gravatar for pennakiza
7 days ago by
pennakiza40
pennakiza40 wrote:

Hi all,

I am trying to analyze a dataset of samples that had early or late progression and I am using limma.

My code is as follows:

design<-model.matrix(~ 0 + df$progression, data = df)
colnames(design)<- levels(df$progression)
contr<-makeContrasts(EvsL=Early-Late, LvsE=Late-Early, levels=design)
y_des_df<-voom(y_df, design = design, normalize.method = "quantile")
y_des_df.fit<-lmFit(y_des_df, design = design)
y_des_df.fit<-contrasts.fit(y_des_df.fit, contrasts=contr)
y_des_df.fit<-eBayes(y_des_df.fit) 
topTable_df<-topTable(y_des_df.fit, p.value = 0.05, coef="LvsE", n=Inf, adjust.method = "BH", sort.by="logFC")
head(topTable_df)

And this is my topTable:

GeneID Chr Length Symbol   type_of_gene    logFC  AveExpr         t
1915    1915   6   4090 EEF1A1 protein-coding 13.91672 13.91672 173.13201
4512    4512   M   1542   COX1 protein-coding 12.91041 12.95493 115.00098
7178    7178  13   4814   TPT1 protein-coding 12.75672 12.61048  86.01986
567      567  15   1588    B2M protein-coding 12.69427 12.52030  91.45237

My problem is that the logFC is very similar to my AveExpr and if I look in my count data I see no real difference between different samples, for example:

    Early_1 Early_2 Early_3 Late_4 Late_5 Late_6 
COX1        13.06414        12.60298      13.06414      13.06414 13.06414      12.87005

I assume that there is something wrong with my contrasts but I can't understand what. Is anyone able to help me please?

Many thanks!

voom rna-seq limma bioconductor R • 216 views
ADD COMMENTlink modified 5 days ago by russhh4.8k • written 7 days ago by pennakiza40
1

That is weird. Maybe it's because your contrast are essentially the same? Early vs Late is the same as Late vs Early (only the logFC direction will change). I think you don't need the contrast at all because your design is simple (AvsB).

ADD REPLYlink written 7 days ago by Amar620

Hi Amar, I've tried it completely without contrasts too and the results are the same. Also, when I use a different contrast (or different coefficient in the case of no contrasts) the results look exactly the same!

ADD REPLYlink written 7 days ago by pennakiza40

...a bizarre coincidence, or, like, winning the national lottery? Can you explain a bit more about the data itself. You have not shown your pre-processing steps.

ADD REPLYlink written 7 days ago by Kevin Blighe52k

Hi Kevin,

Yes, I know, but not as fun as winning the lottery! :/

I've used feature counts and edgeR, as follows:

fc_df <- featureCounts(files=filenames,annot.inbuilt="hg38",isPairedEnd = T)
colnames(fc_df$counts)<-df$Sample

y_df <- DGEList(counts = fc_df$counts, genes = Ann)

isexpr_df <- rowSums(cpm(y_df) > 0.5) >= 2
y_df <- y_df[isexpr_df, keep.lib.sizes=FALSE]
y_df<-calcNormFactors(y_df)

That's pretty much it, I'm not sure if you need anything else?

Thanks very much!

ADD REPLYlink written 6 days ago by pennakiza40

Thanks! How does a MDS plot appear?

plotMDS(y_d)

?

What is the output of

table(isexpr_df)

?

Also, what if you avoid quantile normalisation and instead just use:

y_des_df <- voom(y_df, design = design)
ADD REPLYlink written 6 days ago by Kevin Blighe52k

Thank you Kevin!

Same results without the normalisation!!

isexpr_df
FALSE  TRUE 
13490 14905

And my MDS plot looks like that: (Red-Late, Green-Early)

https://ibb.co/yy0nqF6

ADD REPLYlink modified 6 days ago • written 6 days ago by pennakiza40

The number of genes failing your is expressed filter is quite high, which is interesting in itself. Were the raw counts generally low for all samples?

ADD REPLYlink written 6 days ago by Kevin Blighe52k

Yeah, I can't say that they're massively high...

for example:

WTCHG_444413_216191 WTCHG_444413_217108
100287102                   0                   1
653635                    140                 244
102466751                   0                   6
100302278                   0                   0
645520                      0                   0
79501                       0                   0
729737                     32                 190
102725121                   1                   9
102723897                 127                  33
102465909                   0                   0
729759                      0                   0
100132287                   0                   1
105378947                   0                   0
101928626                   0                   0
102465432                   4                5517
81399                       0                   0
100133331                  34                  33
100288069                  42                  47
400728                     10                   9
ADD REPLYlink written 6 days ago by pennakiza40

Those are genes as rows, right? The number of zeros seems high even after you filtered the data? I have used DESeq2 much more than limma / voom, so, I'm not sure how voom handles zeros.

ADD REPLYlink written 6 days ago by Kevin Blighe52k

Oops, sorry, wrong data set! Here's the correct one, that's before the calcNormFactors:

WTCHG_444413_209107 WTCHG_444413_225109 WTCHG_444413_214167
653635                    334                 270                 406
102466751                   9                   7                   7
729737                    165                 264                 278
102723897                  61                  71                  85
102465432                   3                   4                   9
100133331                   4                  37                  25
100288069                  43                  74                  31
400728                     14                   8                  11
79854                      10                  10                  20
643837                    156                 218                 193

And yes, the rownames are the transcript ids

ADD REPLYlink written 5 days ago by pennakiza40

It has come to the point whereby, at least from my perspective, it would literally require an examination of the input and output of every single command that is being run, i.e., in order to ensure that each is doing what you expect. The fact remains that it could still just be coincidence, I suppose.

Another option: post this on the Bioconductor support forum (and link back to this thread), where Gordon Smyth (limma and EdgeR 'mastermind') will pick it up.

ADD REPLYlink written 5 days ago by Kevin Blighe52k

I just tried everything using your exact code but using random data, and was not able to reproduce the problem. This code is reproducible:

df <- data.frame(
  progression = c('Early', 'Late', 'Late', 'Late', 'Early', 'Early'),
  sample = c('SJA023044', 'SJH027017', 'SJJ027008', 'SJK023006', 'SJR026007', 'SJS025009'),
  row.names = c('WTCHG_444413_209107', 'WTCHG_444413_225109', 'WTCHG_444413_214167', 'WTCHG_444413_226121', 'WTCHG_444413_215179', 'WTCHG_444413_216191'))

fc_df <- data.frame(row.names = paste0('gene', 1:1000))
fc_df$counts <- round(matrix(rexp(6000, rate=.1), ncol=6), 0)
rownames(fc_df$counts) <- rownames(fc_df)
colnames(fc_df$counts) <- df$sample

Ann <- data.frame(
  chr = 1:1000,
  ann1 = paste0('ann', 1:1000),
  row.names = rownames(fc_df))


require(edgeR)

y_df <- DGEList(counts = fc_df$counts, genes = Ann)
isexpr_df <- rowSums(cpm(y_df) > 0.5) >= 2
y_df <- y_df[isexpr_df, keep.lib.sizes=FALSE]
y_df<-calcNormFactors(y_df)

design<-model.matrix(~ 0 + df$progression, data = df)
colnames(design)<- levels(df$progression)
contr<-makeContrasts(EvsL=Early-Late, LvsE=Late-Early, levels=design)
y_des_df<-voom(y_df, design = design, normalize.method = "quantile")
y_des_df.fit<-lmFit(y_des_df, design = design)
y_des_df.fit<-contrasts.fit(y_des_df.fit, contrasts=contr)
y_des_df.fit<-eBayes(y_des_df.fit) 
topTable_df<-topTable(y_des_df.fit, p.value = 1, coef="LvsE", n=Inf, adjust.method = "BH", sort.by="logFC")
head(topTable_df)
ADD REPLYlink written 5 days ago by Kevin Blighe52k
1

Thanks Kevin, I ran it again from scratch, it seems now that my results are not significant, since the toptable is empty.

Thanks for all you help!

ADD REPLYlink written 5 days ago by pennakiza40

I just noticed this thread on Bioconductor, too: https://support.bioconductor.org/p/126919/

ADD REPLYlink written 2 days ago by Kevin Blighe52k
1

can you post both your design matix and your contrasts matrix, please

ADD REPLYlink written 6 days ago by russhh4.8k

That's my contrasts matrix:

 Contrasts
Levels  EvsL LvsE
  Early    1   -1
  Late     -1    1

And that's my design matrix:

   Early Late
7      1   0
23     0   1
29     0   1
34     0   1
43     1   0
47     1   0
attr(,"assign")
[1] 1 1
attr(,"contrasts")
attr(,"contrasts")$`df$progression`
[1] "contr.treatment"
ADD REPLYlink written 5 days ago by pennakiza40

What does df look like? It looks like this is testing the intercept, which shouldn't actually exist.

ADD REPLYlink written 6 days ago by Devon Ryan93k

Hi Devon,

This is how my df looks like:

WTCHG_444413_209107   Early SJA023044   
WTCHG_444413_225109     Late SJH027017   
WTCHG_444413_214167     Late SJJ027008  
WTCHG_444413_226121     Late SJK023006   
WTCHG_444413_215179   Early SJR026007  
WTCHG_444413_216191   Early SJS025009
ADD REPLYlink modified 6 days ago • written 6 days ago by pennakiza40

In the Toptable for COX1, your logFC is equal to the mean of the (Early_1, Early_2, Early_3) values (which is incorrect; it should equal the mean of the 'late's, minus the mean of the 'early's), whereas the AveExpr value is equal to the mean across all samples (which is correct). Very odd. Are there only two levels in your progression column?

ADD REPLYlink written 6 days ago by russhh4.8k

Yes, indeed it's very strange! These are the only too levels in my progression column!

ADD REPLYlink written 5 days ago by pennakiza40
3
gravatar for russhh
5 days ago by
russhh4.8k
UK, U. Glasgow
russhh4.8k wrote:

Can we just rewrite the script so that you're not overwriting variables (which is sometimes a source of errors) and putting an explicit coef for the contrast that you're interested in within the design matrix.

~~~~

# parameterise the model differently (fits Intercept and Late-vs-Early coef)
design <- model.matrix(~ progression, data = df)
colnames(design) <- c("intercept", levels(df$progression)[-1])

# remove the early-vs-late comparison because it duplicates the late-vs-early comparison
contr <- makeContrasts(LvsE = "Late", levels=design)

y_des_df <- voom(y_df, design = design, normalize.method = "quantile")

# initial fit
y_des_df.fit <- lmFit(y_des_df, design = design)

# contrasts fit
y_des_df.contrastfit <- contrasts.fit(y_des_df.fit, contrasts=contr)

# regularised contrasts
y_des_df.eb <- eBayes(y_des_df.contrastfit)

topTable_df <- topTable(y_des_df.eb, p.value = 0.05, coef="LvsE", n=Inf, adjust.method = "BH", sort.by="logFC")
head(topTable_df)

~~~~

ADD COMMENTlink written 5 days ago by russhh4.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 979 users visited in the last hour