I have a proteomics dataset where (N 36 healthy vs 36 diseased samples were analyzed). The samples were analyzed in 8 batches. I have the data matrix with normalized and non-normalized values. The dataset for final analysis was filtered to retain to 70% valid values in each group so this has introduced missingness in my data (501 rows). I am using BatchQC and following steps from the example as given on a real protein expression dataset to correct the batch effect but it was only considering the rows that were having all expression values.
Following concerns I have
- Should I use normalized or non-normalized values and log2 transformed?
- How should I handle the missing values? (I don't want to impute). I tried replacing missing values with zero but it did not help. The BatchQC only corrected the rows having all the expression values.
- Do I have to worry about taking into account for the biological variables such as age, caner stage and marker values?
Thank you in advance, Santosh