Losing a lot of significant genes after removing outliers. Having cell type composition as covariates
1
1
Entering edit mode
10 months ago
Sara ▴ 30

Hi all,

I am working with the RNA-seq data on humans (24patients-20controls). I used DESeq2 to find differentially expressed genes.

here is the code that I used:

It is corrected for cell-type composition (using cibersort and PCA on the estimated cell-type proportions)

dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable,
                                  directory=folder,
                                  design=~Plate+RIN+Sex+Age+condition+PC2+PC1) #considering PC1,PC2 as covariates, to considering cell type composition as covariates


dds <- estimateSizeFactors(dds)

keep <- rowSums( counts(dds) >= 10 ) >= 20
dds <- dds[keep,]

colData(dds)$condition <- relevel(colData(dds)$condition, ref = "Control")

dds<- DESeq(dds) 

resultsNames(dds) 

resLFC <- lfcShrink(dds, coef="condition_patints_vs_Control", type="apeglm")

vsd <- vst(dds, blind = FALSE)

Here is my PCA plot:

data <- plotPCA(vsd, intgroup= c("condition","Sex"), returnData=TRUE)
percentVar<- round(100 *attr(data, "percentVar"))
ggplot(data, aes(PC1, PC2, color= condition, shape=Sex))+ geom_point(size=3)+
  labs(x=paste0("PC1: ", percentVar[1], "% variance"),
       y=paste0("PC2: ", percentVar[2], "% variance"))

enter image description here

By considering the following cut-off:

res <- resLFC [abs(resLFC $ log2FoldChange) >= 1 & 
                        (resLFC $padj < 0.05),]

Before removing the outliers:

Number of upregulated genes is: 251

Number of downregulated genes is: 253

But after removing those three outliers in the PCA:

Number of upregulated genes is: 14

Number of downregulated genes is: 9

May you please guide me and give some explanation about whether is it normal to lose a lot of genes after removing outliers? should I keep those samples or remove them?

What is the explanation for that much difference between the number of differentially expressed genes? (those 3samples are coming from the same sites and they are patients)

ps: sample size after removing 3 outliers: 21patints- 20controls

I checked the alignment log files for that three samples. the overall alignment rate for them was above 90% like the rest of the samples. (I used HISAT2 for alignment and HTSeqcount for generating the count files)

Thank you in advance!

DESeq2 RNA-seq • 1.2k views
ADD COMMENT
0
Entering edit mode

Also, there is another thing:

when I don't consider PC1,PC2 as covariates (to don't consider cell type composition as covariates); it doesn't look that much difference between the number of significant genes.

(The cut-off is the same as before)

without considering PC1,PC2 as covariates:

Before removing the outliers - without PC1,PC2 as covariates:

Number of upregulated genes is: 392

Number of downregulated genes is: 254

But after removing those three outliers - without PC1,PC2 as covariates:

Number of upregulated genes is: 524

Number of downregulated genes is: 251

considering PC1,PC2 as covariates:

Before removing the outliers - considering PC1,PC2 as covariates:

Number of upregulated genes is: 345

Number of downregulated genes is: 242

But after removing those three outliers - considering PC1,PC2 as covariates:

Number of upregulated genes is: 14

Number of downregulated genes is: 9

It made me really confused. any ideas, please?

ADD REPLY
0
Entering edit mode

Hi, that may be important, I was also wondering about your formula. Could you explain why you included them? They are not experimental variables, so maybe they should not be included?

ADD REPLY
0
Entering edit mode

Do you mean about the PC1,2 in my design matrix?

dds <- DESeqDataSetFromHTSeqCount(sampleTable=sampleTable,
                                  directory=folder,
                                  design=~Plate+RIN+Sex+Age+condition+PC2+PC1)

Then, how do I have to adjust my differentially expressed genes for cell type proportions? How my design formula should be to consider PC1-2 as a covariate?

ADD REPLY
2
Entering edit mode
10 months ago

First, check the

  • sequencing depth of those 3 outliers
  • whether they have too many ribosome-specific reads or any such artifacts.

Basically the number of reads aligned to the genome should be similar to other samples.

All the outliers are female patients, Is your disease gender specific?

  • I suggest you pick randomly 3 patients (which are not outliers), and samples with the 3 outlier samples, and perform differential expression analysis.

  • After that, perform Gene ontology enrichment analysis on those differentially expressed genes.

  • Check whether you get any relevant GO terms.

If you don't find anything relevant, remove those samples.

ADD COMMENT
1
Entering edit mode

I agree. Also, running MultiQC over the output of all analysis steps (from raw data QC - counting) should help determine whether the outliers are due to technical variation/error.

ADD REPLY

Login before adding your answer.

Traffic: 2041 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6