First of all, I apologize for a long post . I am analyzing miRNA sequencing data generated from human samples. I have 10 paired samples - control and treated. These samples were sequenced in two batches - 4 pairs in one batch and 6 pairs in another batch. The objective is to identify differentially expressed miRNAs between control and treated groups. The experimental design is as follows:
Sample_ID Condition Patient Batch IonXpress_RNA_001 Healthy A 1 IonXpress_RNA_002 Treated A 1 IonXpress_RNA_003 Healthy B 1 IonXpress_RNA_004 Treated B 1 IonXpress_RNA_005 Healthy C 1 IonXpress_RNA_006 Treated C 1 IonXpress_RNA_007 Healthy D 1 IonXpress_RNA_008 Treated D 1 IonXpress_RNA_011 Healthy E 2 IonXpress_RNA_012 Treated E 2 IonXpress_RNA_017 Healthy F 2 IonXpress_RNA_018 Treated F 2 IonXpress_RNA_021 Healthy G 2 IonXpress_RNA_022 Treated G 2 IonXpress_RNA_023 Healthy H 2 IonXpress_RNA_024 Treated H 2 IonXpress_RNA_025 Healthy I 2 IonXpress_RNA_026 Treated I 2 IonXpress_RNA_027 Healthy J 2 IonXpress_RNA_028 Treated J 2
I tried to analyze the data using DESeq2. I checked to see if the samples cluster based on batch and yes, the PCA plot clearly showed two distinct clusters - batch 1 and batch 2. I used variance stabilized counts for this purpose. Then, I removed batch effects using
assay(vsd)<-limma::removeBatchEffect(assay(vsd), vsd$Batch) . PCA plot now showed as a single cluster and not as two distinct clusters.
To identify differentially expressed miRNAs, I used
<-DESeqDataSetFromMatrix(countData=cts, colData=coldata, design=~Patient + Condition). Since this is a paired sample analysis, I used the variable 'patient' in my design. I did get a list of differentially exprssed miRNAs. However, when I tried to include "batch" in the design, I am getting the following error message -
invalid class “DESeqDataSet” object: the model matrix is not full rank, i.e. one or more variables in the design formula are linear combinations of the others
I am trying to figure out the reason for this error - is this because the conditions (healthy and treated) are overlapping between both the batches? Is there a way that I can rectify this ? Should I change my design ? If I do not correct for batch effects or remove the batch effects, will it drastically change my output? I understand that batch effects can lead to spurious results. However, I am asking this because ~ 75% of the differentially expressed miRNAs obtained from my analysis have got validated in RT-PCR. I understand that the input data has to be raw counts for DESeq2 but is there a way that I can do a differential expression analysis for batch corrected dataset and compare the results with the uncorrected dataset? This is for learning purpose.
Thank you very much Best regards, Preethi