4 months ago

I'm a statistician who's developed a method for batch effect correction, particularly in cases where the batch variable itself is confounded with the target variable. For example, suppose you have two batches of gene expression data, and you want to predict age from gene expression. So you have gene expression variables, batch variable, and age variable. In addition to the batch variable affecting gene expression, my method addresses confounding that arises when your two batches may have different distributions of ages (eg batch 1 might skew younger than batch 2).

Current approaches tend to assume that the batch variable and target variable are independent. They also tend to assume that you have all confounding variables available. But my method also works when the target variable isn't available on new samples where you want to apply your predictor to. (For example, you may have samples from batch 2 with unknown ages, which you want to predict. But you want to fit your predictor on batch-corrected combined data from batch 1 and batch 2.)

What are some good publicly available datasets for me to try my method on? I have previous background working with gene expression data, so I'd be especially interested in such datasets. Alternatively, if you think my method would be useful for your own non-public data and you'd like to collaborate, please feel free to reach out!

