Question

Suggestions for avoiding False Positives while identifying differentially expressed genes with more than 3 samples per group

1

Entering edit mode

4.7 years ago

hd24.bt ▴ 10

Hi,

I have RNA seq data with the samples defined as follows:

ID level    stability  type
A1    med     unstable    exp
A2    med     unstable    exp
A3    med     unstable    exp
B1    med       stable    exp
B2    med       stable    exp
B3    med       stable    exp
C1    low       stable    exp
C2    low       stable    exp
C3    low       stable    exp
D1    low     unstable    exp
D2    low     unstable    exp
D3    low     unstable    exp
E1   high       stable    exp
E2   high       stable    exp
E3   high       stable    exp
F1   high     unstable    exp
F2   high     unstable    exp
F3   high     unstable    exp
G1   host         host  host1
G2   host         host  host1
G3   host         host  host1
H1   host         host  host2
H2   host         host  host2
H3   host         host  host2
I1   host         host  host3
I2   host         host  host3
I3   host         host  host3

ID column here describes the samples in 3 biological replicates. My objective is to identify DEGs while comparing different samples in pair (A vs B, B vs C, A vs D etc) as well as in group-wise comparison (stable vs unstable, host vs expressed). But when I define a contrast such as "contrast = c("stability","unstable","stable")", I get some DEGs with just one of the replicates from 3 different stable or unstable samples being high or low in comparison to others as well.

As I understand that's because when the program identifies a gene in at least 3 samples within a group of the same profile (irrespective of being just 1 of the replicates of 3 different samples) with significant difference from the rest, it reports it as a DEG. However, I would like to know if there is some parameter that can be introduced to say that more than 3 samples in a group of 9 (may be at least 6 samples in our case) are required to have concordance to be reported as DEG.

If not, then can someone kindly suggest some other way to do avoid getting DEGs not representative of entire group but just 3 samples of a big group.

Thanks !!

RNA-Seq • 1.4k views

ADD COMMENT • link updated 4.7 years ago by Kristoffer Vitting-Seerup ★ 4.0k • written 4.7 years ago by hd24.bt ▴ 10

0

Entering edit mode

I get some DEGs with just one of the replicates from 3 different stable or unstable samples being high or low in comparison to others as well

How did you check that?

ADD REPLY • link 4.7 years ago by ATpoint 82k

0

Entering edit mode

When I plot heatmap for significantly differentially expressed genes like this:

topVarGenes <- rownames(sig_de_results)
hmcol <- brewer.pal(11,'RdBu')
nCounts <- counts(DDS_batch , normalized=TRUE)
heatmap(as.matrix(nCounts[topVarGenes, ]), Rowv = NA,Colv = NA, col = hmcol, mar = c(8,2), keep.dendro = FALSE)

ADD REPLY • link 4.7 years ago by hd24.bt ▴ 10

0

Entering edit mode

Plotting heatmaps on non-log2 transformed counts is not recommended as high counts will dominate the scaling. Do at least log2(x+1), better z-score the data. This also appears to be an eyeballing method, therefore not recommended as well as not reproducible as it lacks exact cutoffs. When you perform DESeq2 (or other tools) analysis, the significant genes are representative for the group as the group mean is significantly different than the group mean of the other group respecting the dispersion of the data. I think you worry too much. I would perform standard analysis and then take the genes significant at e.g. 5% or 1% FDR. These tools have elaborate ways to ensure that single outliers do not skew the results especially at these low sample sizes.

ADD REPLY • link 4.7 years ago by ATpoint 82k

0

Entering edit mode

I tried plotting z-scores after log normalisation as well. But I still get some such results. However, may be I will trust these as it is and go ahead with same result.

Thanks for your response.

ADD REPLY • link 4.7 years ago by hd24.bt ▴ 10

0

Entering edit mode

Do I understand correctely that you are doing stable vs unstable, but when you look at the DEGs you see genes that are, for example, consistently high in B1, B2 and B3, but not in C1, C2, C3,E1,E2,E3?

ADD REPLY • link 4.7 years ago by i.sudbery 19k

0

Entering edit mode

As I understand that's because when the program identifies a gene in at least 3 samples within a group of the same profile (irrespective of being just 1 of the replicates of 3 different samples) with significant difference from the rest, it reports it as a DEG.

This is not how DESeq(2)/limma/edgeR work. They will take all the samples that fall within one of your groups (say stable) and calculate the mean expression level. They then calculate a variance on that mean (this where the DESeq/edgeR/limma magic happens). They then ask if the mean of your two categories is sufficiently different compared to the variance using a negative binomial model.

ADD REPLY • link 4.7 years ago by i.sudbery 19k

score 0 · Answer 1 · 2019-08-13

First of all, it is not advisable to check for DEG between the replicates themselves, unless you have a very good reason for it (you know one of them were treated differently, etc.). I would also advice that you limit your efforts to comparisons between the categories, as they will provide more meaningful data.

I would advice you to do a PCA analysis against any of the conditions, as this doesn't matter with the PCA anyways. The PCA should only focus on the top genes. I usually choose the top genes with a p value less than 0.1, so it is not very strict. If on this PCA you see one of the replicates being clearly an outliner, you can try to remove it and redo the PCA and check again for clear outliners. Keep in mind the percentages of variability of each axis.

Now answering to your question, in DESeq2 and in most other programs, you will get the fold change as part of the output. I usually filter away fold changes lower than 1.5, so unless the gene is significantly expressed, it will not appear in my results. The column for that in DESeq2 is log2FoldChange, so you need to filter you the ones lower or higher than log2(1.5).

If you need, I can provide some sample code made with DESeq2.

score 0 · Answer 2 · 2019-08-15

I like restonpiston's PCA on top genes - it's kinda similar to edgeR::plotMDS() which can be interpreted as a PCA plot of the log2FC of each sample against all the others. An alternative is to try an use limmas weighted voom which automatically detect outlier samples and downweight their importance. You can read more about that in this paper or page 72 of the limma vignette.

With regards to filtering out log log2FC most tools ( edgeR/DESeq2/limma) can actually directly test for log2FC larger than a cutoff (so you test for genes with a fold change larger than X instead of filtering afterwards - the results are more trustworthy since there is some uncertainty in the log2FC estimate which could cause the post-test filtering to give wrong results). In DESeq2 this is done with the "lfcThreshold" paramter as described here and in edgeR it is done via glmTreat() as described on page 24 of the vignette.

Wirth regards to robust DE edgeR's quasi-likelihood (QL) approach should also be quite good - especially if the "fit" and "test" are run with robust = TRUE. Both methods are specifically designed to handle outlier.

Good luck

Kristoffer