I am trying to compare the genotypes between two human cohorts: one sequenced by whole genome sequencing (50X) and another one sequenced using a custom panel (250X). I performed a t-distributed stochastic neighbor embedding (t-SNE) analysis and the two populations look perfectly clustered in two different groups.
I suspect the difference in clustering might be due to the usage of different technologies (WGS and target sequencing).
The DNA was sequenced in an Illumina platform and the SNVs were called using GATK HaplotypeCaller and they were recallibrated for both populations. However, the mean total variants per sample is higher in the targeted sequenced cohort.
I created a matrix of 0/1 for absence/presence of variants in each genomic position reported on the VCF file from the WGS and Target cohorts, as shown in the example below:
Sample1 Sample2 Sample3 chr3:37428076 0 1 0
I created a final SNVs list by adding the cohort-specific coordinates to the other cohort to have the same number of coordinates.
Does anyone know how to perform this kind of comparison?