I have a proteomics dataset where (N 36 healthy vs 36 diseased samples were analyzed). The samples were analyzed in 8 batches. I have the data matrix with normalized and non-normalized values. The dataset for final analysis was filtered to retain to 70% valid values in each group so this has introduced missingness in my data (501 rows). I am using BatchQC and following steps from the example as given on a real protein expression dataset to correct the batch effect but it was only considering the rows that were having all expression values.
Following concerns I have
Should I use normalized or non-normalized values and log2 transformed?
How should I handle the missing values? (I don't want to impute). I tried replacing missing values with zero but it did not help. The BatchQC only corrected the rows having all the expression values.
Do I have to worry about taking into account for the biological variables such as age, caner stage and marker values?
Should I use normalized or non-normalized values and log2 transformed?
That depends on the batch-correction method - check the relevant documentation. If using limma:removeBatchEffect(), please use log2-transformed.
How should I handle the missing values? (I don't want to impute). I
tried replacing missing values with zero but it did not help. The
BatchQC only corrected the rows having all the expression values.
Can you define "did not help"? There are other things to try:
impute as half of the lowest non-zero value in the dataset
impute as median on a protein-wise basis, if using univariate statistical tests
Do I have to worry about taking into account for the biological
variables such as age, caner stage and marker values?
This is for you to decide after analysing the data. For example, check these variables via ANOVA or Kruskal-Wallis test (non-parametric ANOVA), or via PCA analysis.
I did try limma:removeBatchEffect() and it did not remove the batch effect when I was checking the PCA. With missing values, I replaced with half of the lowest non-zero value in the dataset and then did Combat but it did not give any good separation of conditions in PCA. I did not understand the median wise imputation.
For Combat, I took the normalized intensities and then converted to log2 before correction. I also tried combat-seq on the full expression data (no missing values and raw and unlog data) and it did not help me in getting rid of batch effect.
Hi, no, ComBat-seq is just for applying a correction to bulk RNA-seq raw counts. ComBat (the original) is for correcting log2-transformed data, or other data that follows a normal distribution.
Why do you expect separation of conditions on a PC bi-plot? There does not always have to be separation.
I did try limma:removeBatchEffect() and it did not remove the batch
effect
I cannot see what you are looking at, so, cannot comment further. Be wary that, if your experimental groups are unbalanced, or if batch confounds condition, then there is minimal that you can do with the data. Perhaps you can share a table that illustrates the relationship between condition and batch?
Hi Kevin, Thank you for the quick response.
I did try limma:removeBatchEffect() and it did not remove the batch effect when I was checking the PCA. With missing values, I replaced with half of the lowest non-zero value in the dataset and then did Combat but it did not give any good separation of conditions in PCA. I did not understand the median wise imputation.
For Combat, I took the normalized intensities and then converted to log2 before correction. I also tried combat-seq on the full expression data (no missing values and raw and unlog data) and it did not help me in getting rid of batch effect.
Thanks, Santosh
Hi, no, ComBat-seq is just for applying a correction to bulk RNA-seq raw counts. ComBat (the original) is for correcting log2-transformed data, or other data that follows a normal distribution.
Why do you expect separation of conditions on a PC bi-plot? There does not always have to be separation.
I cannot see what you are looking at, so, cannot comment further. Be wary that, if your experimental groups are unbalanced, or if
batch
confoundscondition
, then there is minimal that you can do with the data. Perhaps you can share a table that illustrates the relationship betweencondition
andbatch
?Hi Kevin,
Thank you for the comments. Please refer to the related tables in here data
Thanks, Santosh