Question

How many tests does it take to warrant a microarray/RNA-seq approach to differential gene expression analysis?

1

Entering edit mode

7.4 years ago

lucas.james.wager ▴ 10

Background Information:

I have 53 genes of interest. I have two conditions Tumour and Normal samples at 24, 48 and 72 hours from three mice (i.e. 3xTumour and 3x Normal for each timepoint). The Tumours and Normal skin are sampled repeatedly (i.e. each mouse has Tumor and Normal samples taken at each time point). I have measured the expression of all my genes of interest (53) in each sample using qRT-PCR.

If these data were RNA-seq or microarray I would turn to something like DEseq2 to perform a differential expression analysis. If I only had up to 10 genes of interest I would have no concern to use a repeated measures ANOVA or linear models to investigate differential expression between my two conditions.

However I feel like I am somewhere in between each method, as correcting for multiple testing for 53 ANOVAs is diminishing my Pvalues.

Here are 20 of the genes of interest over the 24, 48, and 72 hours (for now ignore the green 0H data points). I am interested in the difference between conditions.

I am aware that N=3 is a very small sample, but this is the nature of biological sciences (especially during a PhD). With this in mind humour me in your answer to my questions.

Questions:

After how many measurements does one decide that they are performing enough tests to warrant a software package such as DEseq2 (i.e. a microarray/RNA-seq approach) to investigate differential expression and how many tests are not enough?

Are there tests or packages that better deal with low to moderate numbers of statistical tests?

In the case that I am not raising a valid concern (quite possible), what would be the most powerful approach to discerning any difference in gene expression that exists between tumour and normal tissue, at any point in time, in my study?

Thanks in advance

-Lucas

gene R • 2.6k views

ADD COMMENT • link updated 7.4 years ago by Devon Ryan 104k • written 7.4 years ago by lucas.james.wager ▴ 10

2

Entering edit mode

As a remark: the assumption in tools such as DESeq2 and edgeR is that most genes in the dataset are not differentially expressed. I don't think that is the case for your selection? Did you quantify housekeeping genes using qRT-PCR?

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

I did quantify housekeeping genes, however these are not shown in my selection, but I use them to create these graphs. I have plotted the DELTA Ct of my genes of interest calculated by subtracting the average Ct of three housekeeping genes at each time point for each condition, converting to fold change and then normalising to the 0H point. To comment on your remark regarding the presence of many changes in my genes of interest, this difference may not be as prominent as it seems upon first glance. I have allowed a free y scale to exemplify the patterns in the genes individually (patterns I would like to see). However these patterns diminish when the Y scale is kept constant. I will try to test for differential expression with DEseq2. As it will consider all the genes at once perhaps the changes that seem apparent here will not have a detrimental effect on the assumption of consistency for the majority of genes. I have attached the same genes graphed with a consistent Y scale.

Same genes with a consistent Y scale

ADD REPLY • link 7.3 years ago by lucas.james.wager ▴ 10

1

Entering edit mode

Did you try something like bioconductor HTqPCR? I never used it myself, but if I had data like yours I would probably try to find something in bioconductor first.

ADD REPLY • link 7.4 years ago by Benn 8.3k

0

Entering edit mode

No I haven't yet. Now it's installed and I'll give it a go. Thanks for the suggestion.

ADD REPLY • link 7.3 years ago by lucas.james.wager ▴ 10

0

Entering edit mode

So I tried HTqPCR, but found that I was unsure of what was happening behind the scenes, so I resorted to limma - which I understand was implemented in HTqPCR anyway.

I have come to a point of confusion. There is an example of applying limma to differentiate between the expression of 3 replicates forming two groups (dummy data) given by Gordon Smyth in the limma package vignette (bottom of page). In this example there is a comment "Would like to consider original two estimates plus difference between first 3 and last 3 arrays" that I do not understand. This comment and following method of constructing a contrast matrix seems at odds with another BiostarUser for investigating the gene DE between groups that do not "consider original two estimates" and examples I have followed in the limma user guide (section 9.2). I'm sure this comes down to my lack of understanding when it comes to contrast matrices. However I have been unable to resolve this myself.

Help with understanding the difference between the two methods is greatly appreciated.

Smyth method

#  Simulate gene expression data: 6 microarrays and 100 genes
#  with one gene differentially expressed in first 3 arrays
M <- matrix(rnorm(100*6,sd=0.3),100,6)
M[1,1:3] <- M[1,1:3] + 2
#  Design matrix corresponds to oneway layout, columns are orthogonal
design <- cbind(First3Arrays=c(1,1,1,0,0,0),Last3Arrays=c(0,0,0,1,1,1))
fit <- lmFit(M,design=design)
#  Would like to consider original two estimates plus difference between first 3 and last 3 arrays
contrast.matrix <- cbind(First3=c(1,0),Last3=c(0,1),"Last3-First3"=c(-1,1))
fit2 <- contrasts.fit(fit,contrast.matrix)
fit2 <- eBayes(fit2)
#  Large values of eb$t indicate differential expression
results <- classifyTestsF(fit2)
vennCounts(results)

vs.

Biostar user method

library(limma) 
celFiles <- list.celfiles(path_to_cel_files,full.names=TRUE)
affyExpressionFS <- read.celfiles(celFiles)
eset <- rma(affyExpressionFS, background = TRUE, normalize = TRUE)
stress <- c(1,1,1,0,0,0) # first three replicates (case)
control <- c(0,0,0,1,1,1) # last three replicates (control)
design <- cbind(stress, control) # first term is numerator, second denominator of the ratio
fit <- lmFit(eset, design)
cont.matrix <- makeContrasts(STvsCO=stress-control, levels=design)
fit2 <- contrasts.fit(fit, cont.matrix)
fit2 <- eBayes(fit2)
tab <- topTable(fit2, adjust = "BH", confin =T , number = 1000, sort.by="logFC")

ADD REPLY • link 7.3 years ago by lucas.james.wager ▴ 10

2

Entering edit mode

I don't exactly understand your question, but I think you're better off asking Gordon Smyth himself for help. He's active on https://support.bioconductor.org/ and usually answers questions about limma within a day. Be sure to add limma as a tag.

ADD REPLY • link 7.3 years ago by Benn 8.3k

score 2 · Answer 1 · 2017-01-16

2

Entering edit mode

7.4 years ago

Devon Ryan 104k

You start gaining power by using limma/DESeq2/et al. once you have more than one gene.
Yes, limma/DESeq2/etc. (for RNAseq, for qPCR there's a different bioconductor package that can probably do most of the analysis for you).
What WouterDeCoster said.

ADD COMMENT • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

Can we really use DESeq2 with continuous data (vs integer count) ? It's a genuine question, I really don't know.

ADD REPLY • link 7.4 years ago by Carlo Yague 8.7k

0

Entering edit mode

It's not that it's been designed for qPCR, but tximport for kallisto/salmon pseudocounts isn't exclusively for integers.

ADD REPLY • link 7.4 years ago by WouterDeCoster 47k

0

Entering edit mode

It appears that DESeq2 still only accepts integer input (in fact, the DESeqDatasetFromTxImport function rounds them internally (https://github.com/Bioconductor-mirror/DESeq2/blob/810987ffc360847554b8a0b3f2cca430fc834ca9/R/AllClasses.R#L338-L341 ).

ADD REPLY • link 7.4 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you for the response. The idea of rounding these values is of concern, as I am dealing with such small changes. After internal rounding I am likely to end up with a great deal of '1's.

Would a consistent multiplication of the inputs before analysing them with DEseq be a bad idea? For example, if raw inputs to DEseq, 1.48 and 0.84, were tested without multiplication they would both be rounded to 1 and show no difference, but by multiplying by 10 they would round to 15 and 8, making the tests more sensitive(?).

I guess the question would then be one of the relevance of any changes found. I'm interested in your opinion on this.

ADD REPLY • link 7.3 years ago by lucas.james.wager ▴ 10

1

Entering edit mode

I'm not sure how nicely that'd play with a negative binomial fit. In any case, you could always use edgeR or limma/voom instead.

ADD REPLY • link 7.3 years ago by Devon Ryan 104k

0

Entering edit mode

Using edgeR, DESeq or DESeq (or any NB-based software) for PCR data is not at all appropriate, and for reasons that run much deeper than whether the values are integer or not. Use limma or HTqPCR (a limma wrapper) instead (and ordinary limma, not voom).

ADD REPLY • link 5.7 years ago by Gordon Smyth ★ 7.2k