Question: Detecting outlier DESeq2
2
15 months ago by
maria2019130
maria2019130 wrote:

I have RNA-seq data for 20 samples with 2 condition and 2 sex (male, female : control, treatment). I am very new to RNA-seq analysis and am trying to find the DEGs using DESeq2. Since I want to have the normalization to be calculated based on all samples, I get the rld based on all and then I will use contrast to find the DEG for 1. tratment vs. control male 2. tratment vs. control female.

For the QC, the PCA plot separates the male and female but the control and treatment is not separated very well on the PC2 for male C1 (Figure 1)

Then I decided plot PCA only for male samples. again PC1 does not separate the control and treatment male samples but PC2 is kinda separating them. (Figure2)

My questions are:

1. What numbers on the PCA plot (x and y axis) decide the separation? I read in a website that samples that are at PC1>0 are outlier. is that true?

2. Can I just look at figure 1 and remove male c1 and continue DEG analysis with 9 samples? or I should definately plot figure 2 ?

3. If I need to consider Figure 2, can I rely on only the PC2 which separates control and treatment and continue the DEG analysis or I should remove C1, tr4, tr5 samples and then work on DEG analysis based on remaining 7 samples?

bulk rna-seq deg pca deseq2 • 909 views
modified 15 months ago by Kevin Blighe69k • written 15 months ago by maria2019130
3
15 months ago by
Kevin Blighe69k
Republic of Ireland
Kevin Blighe69k wrote:

Hey,

Regarding the website to which you are referring, that was written by Wolfgang Huber. He is not saying, in his tutorial, that PC1>0 is a global definition of an outlier from a PCA bi-plot. It is just in his example that he has decided to select 2 outlier samples via a vertical cut-off line drawn at PC1>0. In another experiment, the cut-off for outliers may be at PC1 < -500, or PC1 > 42. The units on these axes are, well, essentially unitless.

Looking at your first plot, you cannot really remove male sample C1 just because it groups away from the main group of samples. You should first investigate why it may be grouping in the way that it is [grouping]. Is it a sample mix-up?; is it a male with Klinefelter Syndrome?; is the clinical information for this sample incorrect?

For your third question, you need to consider the fact that `sex` represents a substantial amount of variation in your dataset, i.e., it is greater than the very condition of interest that you are studying - was this expected?; did you include `sex` in your design formula as a covariate? If you include it as a covariate in your design formula, normalise your data again, and then run rld function with `blind=FALSE`, do you see the same PCA bi-plot?

You can also segregate your dataset based on `male` | `female` and then cross-compare results in a sort of meta-analysis, if you wish. `sex` is a clear and major source of variation among your samples. If you proceed with this route (sample segregation based on `sex`), then I see no reason to exclude c1.

Kevin

Hi Kevin,

Thanks for your response. I believe it is not a sample mix up (but I am gonna check it again). I do not know it is a male with Klinefelter Syndrome. How can I check it?

For the normalization, I actually made a 3rd column in my sample table and grouped together the sex and condition.

``````column group of sample table: control_male, treatment_male, control_female, treatment_female
``````

then the dds object as:

``````dds <- DESeqDataSetFromHTSeqCount( sampleTable = sampleTable, directory = directory, design= ~group)

rld <- rlog(dds, blind=TRUE)

data <- plotPCA(rld, intgroup=c("treatment", "sex"), returnData=TRUE )
percentVar <- round(100 * attr(data, "percentVar"))
ggplot(data, aes(PC1, PC2, color=treatment, shape=sex)) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
+coord_fixed()
``````

Then I used contrasts for further analysis:

`````` contrast_male <- c("group", "treatment_male", "control_male")
contrast_female <- c("group", "treatment_female", "control_female")

res_male_unshrunken <- results(dds, contrast = contrast_male, alpha = 0.05)
res_male <- lfcShrink(dds, contrast = contrast_male, res=res_male_unshrunken)
summary_res_male <- summary(res_male, alpha=0.05)
``````

The result of the pca plot is from the above code. I think making the "group" column is including the sex in the design and is similar to ~sex+condition design. is that right? is the code above what you meant by including sex?

By segregating male | female, do you mean to run the analysis for male and female samples separately due to the major difference between the male/female samples?

Ah, I see... `sex` is already included in `group`. When you look at the results of the differential expression comparison, do they make sense to you?

Out of curiosity, take a look at the output of this:

``````rld <- rlog(dds, blind = FALSE)
plotPCA(rld, intgroup=c("treatment", "sex"))
``````

The results actually make sense with the 10 samples but when I remove C1, I get 2times more DEG genes.

Changing to Blind=FALSE, I get the following PCA plot:

C1,tr4,tr5 are kinda different compared to other samples.

1

Thanks - in that case, it just seems like a genuine result with regard to that sample (C1), unless there was indeed a mix-up. As I don't know your area of study, I cannot comment much further; however, occasionally, controls are not quite what they seem!

There is no right or wrong answer here. You can choose to leave it in your dataset or exclude it. Either way, you have to record what you do.

1

Thank you Kevin, I checked and it is not a sample mix up. I will continue with further analysis and see what makes sense more based on the results.