How can you use normalised count matrix for differential gene expression analysis?
1
0
Entering edit mode
10 months ago

Hi,

I tried to perform a differential gene expression analysis using Limma for some data that resemble microarray data. They are obtained from nanostring and when I performed the Limma I get logFC values that are in the 100s like -500 etc. I don't know if I am doing something wrong with the data set. I tend to make an ExpressionSet using the normalised counts matrix and metadata.

Should I be trying to convert the normalised counts matrix into log-ratios or log-expression values in which case how would I fo about doing this?

R RNA-Seq gene • 722 views
1
Entering edit mode
10 months ago

For analysis options with NanoString, see my recent answer, here: A: Make heatmap for RCC files in Rstudio(NanoStringNorm)

If you would like further assistance, then please show all processing steps (i.e., the code) that you have run.

Kevin

0
Entering edit mode

Thank you Kevin, this is the code I am using. I'm not sure if I am missing any steps.

> pacman::p_load(pacman, rio, ggplot2, htmltools, DESeq2, limma, Biobase, GOexpress)

> Colorectal <- read.csv("C:/Users/Bigitha/Desktop/allnoc6.csv", row.names = 1)

> Colorectal <- as.matrix(Colorectal, rownames=TRUE)

> class(Colorectal)

> dim(Colorectal)

> colnames(Colorectal)

> head(Colorectal) #this is the normalised target count matrix

A1       A2       A3       A4       A5       A6       A7       A8
A2M    111.37228 93.43759 96.10967 68.92820 56.74304 33.45115 40.78395 43.74823
ABCB1   32.25881 39.19703 31.37352 25.84559 37.22006 28.23869 42.81308 25.45319
ABCF1   43.50197 39.45032 42.76654 30.98827 45.42456 61.10323 54.30966 41.83873
ABL1    47.23076 64.66280 47.58509 53.44226 61.02018 85.28357 60.83633 40.86467
ACOT12  32.04164 33.79165 29.51108 50.78428 33.70963 25.25458 34.68069 24.68044
ACSF3   58.79355 51.97980 50.39916 45.36814 57.41808 65.82973 56.23274 53.22973

> phenoData <- new("AnnotatedDataFrame", data = metadata)

> eSet <- ExpressionSet(assayData = Colorectal, phenoData = phenoData)

> design <- model.matrix(~kras, data = pData(eSet))

> colSums(design)

> table(pData(eSet)[,"kras"])

> fit <- lmFit(eSet, design)

> fit <- eBayes(fit)

> df <- topTable(fit, coef=2, n=2000)

logFC   AveExpr         t      P.Value    adj.P.Val        B
POLR2A  -23.92406  72.30563 -8.136987 2.084201e-10 3.801583e-07 8.495753
FOXA1    15.08336  35.97589  7.918481 4.343634e-10 3.961394e-07 8.080353
CAPN2   -39.77087  73.21918 -6.983658 1.044160e-08 6.348492e-06 6.219311
CMTM6   -39.12816  76.75838 -6.765166 2.208931e-08 1.007273e-05 5.766454
RAC3     26.40288  75.76756  6.695406 2.806665e-08 1.023871e-05 5.620601
LAMP1   -96.08082 157.26962 -6.320483 1.017535e-07 2.756852e-05 4.827166
CACNG4   10.32207  33.20535  6.309130 1.058002e-07 2.756852e-05 4.802907
SETD2   -13.77444  46.41535 -6.217496 1.449312e-07 3.304431e-05 4.606658
OAZ1   -248.20727 435.55159 -6.078048 2.338887e-07 4.597504e-05 4.306551
STK11   -33.33797  94.85910 -6.056238 2.520561e-07 4.597504e-05 4.259463

0
Entering edit mode

I see, but how was the data originally normalised?

0
Entering edit mode

The file has normalisation method as THIRD_QUARTILE.

0
Entering edit mode

You should check the distribution of this data via a box-and-whiskers plot and histogram.

0
Entering edit mode

The normalised data?

I also recently got the initial raw counts for this data which has the around 5 probes for each gene which I have summed together. Could I use this to perform a differential expression analysis, if so should I be using limma or EdgeR? I have information regarding which genes are control and endogenous if that is also useful.

1
Entering edit mode

It is important to first understand that NanoString is not like microarray. It is a count-based method that produces data that follows a negative binomial / Poisson-like distribution, just like bulk and single-cell RNA-seq.

When you originally used limma, the assumption would have been that your data was already normalised / transformed to follow a normal distribution, which is what limma expects. This is why I asked you to generate a histogram and box-and-whiskers [to check the distribution].

You can use DESeq2 and EdgeR to process NanoString, but not if your data has already been normalised to follow a normal distribution. EdgeR and DESeq2 take raw count data that follows a negative binomial, so, you'd need the raw NanoString counts.

See my other answer: A: Make heatmap for RCC files in Rstudio(NanoStringNorm)

There are also answers on this topic on Bioconductor Support Forum.

0
Entering edit mode

Thank you Kevin, I was able to obtain a bit more background information on my data. It is using Nanostring's RNA assay with next-generation sequencing readout.

My raw counts look like this:

ProbeDisplayName    TargetName  A1  A2  A3
A2M_05  A2M 142 48  79
A2M_04  A2M 91  35  60
A2M_02  A2M 131 55  136
A2M_03  A2M 184 65  125
A2M_01  A2M 73  36  59
ABCB1_01    ABCB1   33  20  27
ABCB1_04    ABCB1   27  21  29


I have summed up this file according to the target name to look like:

    A1  A2  A3
A2M 621 239 459
ABCB1   174 98  142


I think I'm going to leave the normalised data for now as it looks like there are some errors in how the data might be normalised instead I am thinking of using the raw count data. Would it be a good idea to run the above summed counts through EdgeR or DESeq2? If so, are there any particular variables like housekeeping genes, negative probe spike-in I should be including when performing DESeq2 or EdgeR?

Thank you in advance for your help:) I am very new to bioinformatics so every piece of help you can give is much appreciated!

0
Entering edit mode

The housekeeping genes and positive | negative controls should be defined in the accompanying annotation files that you [should have] received (?). The housekeepers can be used with DESeq2 via RUV-seq, as per the other post to which I linked (above), and also here: http://supportupgrade.bioconductor.org/p/109778/#109779

I don't know why you had multiple probes targeting the same gene, and I'm not sure that summing these is the correct procedure. If they are all targeting the same transcript isoform, then the mean may make more sense (?). I would check this with the company / group / collaborator who did the experiment.

1
Entering edit mode

If all else fails, I would just use nSolver on Windows.

0
Entering edit mode

However, if I was to take the mean then I wouldn't be able to use DESeq2 or EdgeR right? As they both only take raw counts?

0
Entering edit mode

It really depends on what are these values that map to the same gene. I would like to understand that before knowing what to do next.

Taking a mean raw count is no major issue to EdgeR or DESeq2. You would have to round the mean to integer value though.

0
Entering edit mode

I did some digging into what the values are and according to a webinar on this particular experiment type done by nanostring the values are:

"Multiple probes targeting the same transcript are molecularly barcoded and quantitated such that there are up to 10 independent counts per transcript which allows robust quantification so that during the analysis if there are outlier probes can be identified and filtered out."

I've also added the link to the webinar at the time point where the speaker was explaining this.

https://youtu.be/wHiMR5ok8kE?t=213

But it seems they average the probes, so if that is the case then like you said it would be to perform EdgeR or DESeq2 using the means that are rounded up to integer value?

1
Entering edit mode

But it seems they average the probes, so if that is the case then like you said it would be to perform EdgeR or DESeq2 using the means that are rounded up to integer value?

Indeed, I suppose so; however, you could find a way to detect outlier probes via a simple metric like standard deviation or variance (checking sdev or variance across each gene's probes)

0
Entering edit mode

Hi Kevin,

Thank you so much for your help so far. I just wanted to clear something up. If I was to perform a normalisation for quality controlled counts, which average type would be the best: geometric mean, median, average, minimum, maximum or third quartile?

1
Entering edit mode

I think that geometric mean or median are two of the main options for NanoString data (and geometric mean is even used during normalisation for RNA-seq). DESeq2 / EdgeR obviously take care of the normalisation for you, provided that you input the data as raw counts to these programs.

0
Entering edit mode

Would it be okay to round up the quality-controlled counts to integer value for the DESeq2 / EdgeR programs? As essentially, it does not include any outlier probes or data that can skew the overall analysis but the data is not normalised either.

Also, I am trying to use EnhancedVolcano package, is there a way I can remove the Log2FC and P value labels from the volcano plot?