Hello everyone! Let me provide with a bit of context first. I'm performing a principal component analysis in a two-group sample (groups 1 and 2). The type of data that I have are two separated vcfs, one from each group. To do the PCA in Plink, I needed to generate one single vcf file with the individuals from both groups. For merging, I used vcf-merge command from vcftools, which seemed to run correctly.
The problem: after merging both files, doing the PCA and visualizing in an R graph, I noticed the graph was odd (you can see it below), and a labmate told me "ohh, that's a classic merging error, I've seen it before.. but I don't remember much right know. see if you did something wrong in the merging". I'm new to bioinformatics, so I look again and again but I can't find the error. The commands ran smoothly in each step... and I don't have enough knowledge yet to spot the mistake. As my labmate told me that, I decided to post the question here, since it seemed like a "classic rookie mistake".
Here you have every step of the process, to see if you can spot the problem, and the final graph.
./bgzip group1.vcf
./tabix group1.vcf.gz
./bgzip group2.vcf
./tabix group2.vcf.gz
vcf-merge group1.vcf.gz group2.vcf.gz | bgzip –c > bothmerge.vcf.gz
./plink --vcf bothmerge.vcf.gz --pca --out bothmergepca
(Then, loading the bothmergepca.eigenvec file in R, I plotted the first principal component against the second one).
The expected graph was like a cloud of 2000 dots. Note: I have done PCA and visualized it on R before, so I'm more familiar with that and I am pretty certain that the mistake is not in those steps.
You can see the graph here: https://ibb.co/cRMS5k
Hope someone can help me, or at least hint me. Thank you for your time !