I have about 100 NGS exome paired-end samples for which I have as many aligned BAM files. I wish to group them based on similarity or any other 'distance' metric using some kind of algorithm. The goal is to reduce the batch effect in downstream structural variant analysis, if we were to analyze all samples together. The only plausible way to reduce this batch effect is to divide the samples into groups.
The groups should be such that the samples within each group should show a high correlation.
I was wondering what parameter from the BAM file should I use to group these samples, lets say by K-means clustering ?
Any suggestions would be highly appreciated.