Hi,
I tried to perform a differential gene expression analysis using Limma for some data that resemble microarray data. They are obtained from nanostring and when I performed the Limma I get logFC values that are in the 100s like -500 etc. I don't know if I am doing something wrong with the data set. I tend to make an ExpressionSet using the normalised counts matrix and metadata.
Should I be trying to convert the normalised counts matrix into log-ratios or log-expression values in which case how would I fo about doing this?
Thank you Kevin, this is the code I am using. I'm not sure if I am missing any steps.
I see, but how was the data originally normalised?
The file has normalisation method as THIRD_QUARTILE.
You should check the distribution of this data via a box-and-whiskers plot and histogram.
The normalised data?
I also recently got the initial raw counts for this data which has the around 5 probes for each gene which I have summed together. Could I use this to perform a differential expression analysis, if so should I be using limma or EdgeR? I have information regarding which genes are control and endogenous if that is also useful.
It is important to first understand that NanoString is not like microarray. It is a count-based method that produces data that follows a negative binomial / Poisson-like distribution, just like bulk and single-cell RNA-seq.
When you originally used limma, the assumption would have been that your data was already normalised / transformed to follow a normal distribution, which is what limma expects. This is why I asked you to generate a histogram and box-and-whiskers [to check the distribution].
You can use DESeq2 and EdgeR to process NanoString, but not if your data has already been normalised to follow a normal distribution. EdgeR and DESeq2 take raw count data that follows a negative binomial, so, you'd need the raw NanoString counts.
See my other answer: A: Make heatmap for RCC files in Rstudio(NanoStringNorm)
There are also answers on this topic on Bioconductor Support Forum.
Thank you Kevin, I was able to obtain a bit more background information on my data. It is using Nanostring's RNA assay with next-generation sequencing readout.
My raw counts look like this:
I have summed up this file according to the target name to look like:
I think I'm going to leave the normalised data for now as it looks like there are some errors in how the data might be normalised instead I am thinking of using the raw count data. Would it be a good idea to run the above summed counts through EdgeR or DESeq2? If so, are there any particular variables like housekeeping genes, negative probe spike-in I should be including when performing DESeq2 or EdgeR?
Thank you in advance for your help:) I am very new to bioinformatics so every piece of help you can give is much appreciated!
The housekeeping genes and positive | negative controls should be defined in the accompanying annotation files that you [should have] received (?). The housekeepers can be used with DESeq2 via RUV-seq, as per the other post to which I linked (above), and also here: http://supportupgrade.bioconductor.org/p/109778/#109779
I don't know why you had multiple probes targeting the same gene, and I'm not sure that summing these is the correct procedure. If they are all targeting the same transcript isoform, then the mean may make more sense (?). I would check this with the company / group / collaborator who did the experiment.
If all else fails, I would just use nSolver on Windows.
However, if I was to take the mean then I wouldn't be able to use DESeq2 or EdgeR right? As they both only take raw counts?
It really depends on what are these values that map to the same gene. I would like to understand that before knowing what to do next.
Taking a mean raw count is no major issue to EdgeR or DESeq2. You would have to round the mean to integer value though.
I did some digging into what the values are and according to a webinar on this particular experiment type done by nanostring the values are:
"Multiple probes targeting the same transcript are molecularly barcoded and quantitated such that there are up to 10 independent counts per transcript which allows robust quantification so that during the analysis if there are outlier probes can be identified and filtered out."
I've also added the link to the webinar at the time point where the speaker was explaining this.
https://youtu.be/wHiMR5ok8kE?t=213
But it seems they average the probes, so if that is the case then like you said it would be to perform EdgeR or DESeq2 using the means that are rounded up to integer value?
Indeed, I suppose so; however, you could find a way to detect outlier probes via a simple metric like standard deviation or variance (checking sdev or variance across each gene's probes)
Hi Kevin,
Thank you so much for your help so far. I just wanted to clear something up. If I was to perform a normalisation for quality controlled counts, which average type would be the best: geometric mean, median, average, minimum, maximum or third quartile?
I think that geometric mean or median are two of the main options for NanoString data (and geometric mean is even used during normalisation for RNA-seq). DESeq2 / EdgeR obviously take care of the normalisation for you, provided that you input the data as raw counts to these programs.
Would it be okay to round up the quality-controlled counts to integer value for the DESeq2 / EdgeR programs? As essentially, it does not include any outlier probes or data that can skew the overall analysis but the data is not normalised either.
Also, I am trying to use EnhancedVolcano package, is there a way I can remove the Log2FC and P value labels from the volcano plot?