Massive L2FC in DESeq2
1
1
Entering edit mode
19 months ago
Alex ▴ 20

Answered by Michael Love on bioconductor - Bioconductor link

Hi,

I am using DESeq2 v1.22.2 to test for changes in tumors over time pre and post treatment. However when I use the results function, I get log2FC on the order of +/- 30, which is of course unreasonably large. If I follow up with lfcshrink + ashr shrinkage, these barely change and if I use lfcshrink + normal shrinkage, these effects are shrunken to nearly 0. I understand "ashr" generally performs better than "normal" and that it preserves large LFC, but these LFC are unrealistically large with ashr. I have 2 questions:

  1. Is this expected behavior or does log2FC on this scale imply there is something fundamentally wrong with my design or samples?
  2. If this is expected behavior, should I go with "ashr" or "normal"? I am not using apeglm because I need the contrast argument.

More background:

Not all DEGs have such massive LFC - it seems to almost be bimodal with significant genes either having L2FC > 25 or < 5.

There is significant variability in tumor type (breast, colorectal, melanoma etc), but I am hoping this effect is captured by a paired design. Samples generally cluster on PCA based on their patientID, not time. I have at minimum 2 replicates per timepoint, but usually 3 or more. There is also significant variability in RNA quality, which I have binned into "good" or "bad". My design is thus:

design=~PatientID+time+rbin

I then test the effect of time using the contrast argument. I have noticed that the extent of these massive LFC genes is slightly affected by the threshold I use for RIN, but nothing seems to eliminate the effect entirely. I have tried removing rbin, modeling RIN as a continuous variable, removing all tumors with low RIN, SVA (with 1, 2, and 3 SVs), filtering samples based on tumor type, and removing the PatientID variable with no luck.

I apologize if this question has been asked before and am happy to provide more info as needed. Thanks!

-Alex

Edit - more info

colData:

PatientID disease time rbin
PatientA    A   d1   low
PatientA    A   d100 high
PatientA    A   d200 low
PatientB    A   d1   high
PatientB    A   d100 low
PatientC    A   d1   high
PatientC    A   d100 low
PatientC    A   d200 low
PatientD    B   d1   high
PatientD    B   d100 high
PatientD    B   d200 low
PatientE    A   d1   low
PatientE    A   d100 high
PatientE    A   d200 low
PatientF    C   d1   low
PatientF    C   d200 low
PatientG    D   d1   high
PatientG    D   d100 low
PatientG    D   d200 high

Code:

# Read in metadata
meta <- read.csv("meta.csv")

# Import
txi <- tximport(quantfiles,
                type = "salmon",
                tx2gene = tx2gene,
                importer = read_tsv,
                ignoreTxVersion=TRUE,
                countsFromAbundance="lengthScaledTPM")

# Create DESeqDataSet
dds <- DESeqDataSetFromTximport(txi = txi,
                                colData = meta,
                                design=~PatientID+rbin+time)
# Run DESeq
dds <- DESeq(dds)

# DGE
res <- list()
res$d100vd1_noShrink<-results(dds,contrast=c("time","d100","d1"))
res$d100vd1_normalShrink<-lfcShrink(dds,contrast=c("time","d100","d1"),type="normal")
res$d100vd1_ashrShrink<-lfcShrink(dds,contrast=c("time","d100","d1"),type="ashr")

PCA (color = disease, label = patient):

PCAim

MA plots with various shrinkages:

noShrink

norm-Shrink

ashr-Shrink

deseq2 ashr rna-seq • 1.1k views
ADD COMMENT
1
Entering edit mode

Hello Alex. If you want to investigate "time" you should put "time" at right side of design design=~Patient+rbin+time (I think Treatment maybe better in your case, with "before" and "after" 2 levels.). Second I don't think add "RIN" as a variable is good idea, if you prepared those samples library together. "PatientID" neither should be a variable, you can set "CancerType" variable refers to different cancer sample. Maybe design=~CancerType+Treatment

ADD REPLY
0
Entering edit mode

Thank you for your reply, MatthewP! I will move time/treatment to the right. The libraries were prepared together so I will take RIN out of the design. Could you elaborate a bit on why you think PatientID is a bad idea? I based that off the paired design section in the vignette. Here is another example of someone with a similar design using PatientID as a variable:

https://support.bioconductor.org/p/84241/

ADD REPLY
1
Entering edit mode

Before doing things like RIN modeling I would start with the basics. Does the PCA plot look reasonable and does the MA-plot indicate that normalization worked properly? Are the majority of genes centered at y=0 in a MA-plot? I agree with MatthewP that RIN is not a good variable to model, especially not on continuous scale as there is no guarantee for a linear dependency between RIN and gene expression. I personally would rather exclude bad samples rather than having them corrupting the analysis. Either the RNA is unsable or not (at least this is my imagination). Also agree that patient ID should probably not be in there. Unreasonable FCs can also be product of improper contrasts. Please show all code you used.

Can you also please show the colData, so a data frame that lists patient IDs, treatments, times etc... everything that characterizes the experiment, plus the above mentioned plots.

ADD REPLY
0
Entering edit mode

Thank you very much ATpoint! I have edited the post to include colData, code, and plots. I'd really appreciate any additional insight.

ADD REPLY
1
Entering edit mode

Hi Alex, I have been keeping a close eye on this post as I had a similar problem with inflated LFC values.

Using the apeglm package worked very well for me. I see on Bioconductor that you wanted to stick to using ashr as it accepts the contrast argument, but isnt using the coef argument with apeglm the same thing, you just have to specify the index of the contrast from resultsNames()?

ADD REPLY
1
Entering edit mode

Hi Barry,

Thanks for your comment. Yes you're absolutely right - I thought I couldn't because resultsNames() outputs contrasts only with respect to the reference level and there are a few non-reference contrasts I'd like to make. However I just came across this section from the DGE workflow that describes how I could get around it:

Finally, normal and ashr can be used with arbitrary specified contrast because normal shrinks multiple coefficients simultaneously (apeglm does not), and because ashr does not estimate a vector of coefficients but models estimated coefficients and their standard errors from upstream methods (here, DESeq2’s MLE). Although apeglm cannot be used with contrast, we note that many designs can be easily rearranged such that what was a contrast becomes its own coefficient. In this case, the dispersion does not have to be estimated again, as the designs are equivalent, up to the meaning of the coefficients. Instead, one need only run nbinomWaldTest to re-estimate MLE coefficients – these are necessary for apeglm – and then run lfcShrink specifying the coefficient of interest in resultsNames(dds).

I will try doing this and follow up!

ADD REPLY
2
Entering edit mode
19 months ago
ATpoint 55k

Answered by Mike Love (DESeq2/apeglm developer/senior author): https://support.bioconductor.org/p/130623/

ADD COMMENT
1
Entering edit mode

Ha! I came here to say exactly the same thing.

ADD REPLY
0
Entering edit mode

Thank you both! I will move the Bioconductor link to the top.

ADD REPLY

Login before adding your answer.

Traffic: 1635 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6