Hello,
I have about 100 NGS exome paired-end samples for which I have as many aligned BAM files. I wish to group them based on similarity or any other 'distance' metric using some kind of algorithm. The goal is to reduce the batch effect in downstream structural variant analysis, if we were to analyze all samples together. The only plausible way to reduce this batch effect is to divide the samples into groups.
The groups should be such that the samples within each group should show a high correlation.
I was wondering what parameter from the BAM file should I use to group these samples, lets say by K-means clustering ?
Any suggestions would be highly appreciated.
What is the "downstream structural variant analysis" you are going to perform? What batch effect do you expect will confound that analysis? What is the biological question you are trying to answer with your analysis? Any details you have will be helpful in figuring out what you want to do.
If your goal is to to do structural variant analysis on exome data I think you first have to come up with a strategy to do that. I think it is not easy / impossible to do that, cnv and loh is doable but depends on a control sample (you can also do without but haring a control is superior)