Question

including incomplete data set in the PCA & gene expression analysis

0

Entering edit mode

22 months ago

Sara ▴ 270

I am trying to do PCA on my transcriptomics data but the design of experiment was not perfect meaning we have 2 runs for the experiment. The 1st run has 22 records from 11 donors (before and after treatment) and the 2nd run has 16 records from 8 donors (again before and after treatment). But the problem is: 3 of the donors in the 2nd run have incomplete data in the 1st run (including them we have 25 records for the 1st run), in fact for those donors we have only before treatment data (after treatment experiment did not go well) but for those 3 donors we have complete data (before and after treatment). In other word, for 3 individuals we have 6 records in run 2 (complete set) and 3 records in run 1 (incomplete set).

Now my question is, since I am trying to analyze data from both runs together (I will correct for the batch effect), is it correct to use the incomplete data set from those donors in addition to the complete data sets?

RNA-seq • 702 views

ADD COMMENT • link updated 22 months ago by LauferVA 4.7k • written 22 months ago by Sara ▴ 270

score 0 · Answer 1 · 2023-06-27

This is actually not a problem, its a good thing and even worth recommending as good experimental design in some cases.

The presence of what should be identical samples in both runs is helpful (here, run1 before treatment Vs. run2 before treatment) because these samples can be used to get a good handle on how severe any batch effects that exist are, and if so what amount.

Without these samples, you can try to tell something similar by comparing like samples (e.g. all pretreatment run1 vs all pretreatment run2). With good technique (e.g. detection of latent variables and clever use of house keeping genes) you can do pretty well, but you can never be totally certain whether differences detected between Run1Pre and Run2Pre owe to real biological differences between samples, or to batch...