Dear all:
I have done a PCA using R but I am getting extreme values for the first principal component.
The steps that I performed were:
I had a genotype file coded as: 0, 1, 2 or NA.
I replaced the NAs by 1 (heterozygous) as I transformed the matrix to -1, 0 and 1. So NAs would become zero.
I created the G matrix (VanRaden method) and applied the following command on G matrix:
mypca = prcomp(G, center=TRUE)
When I plot the first and second principal component I notticed huge values for PC1. When I plotted PC2 and PC3 I observed what I was expecting.
Do I need to scale the G matrix? What can be causing those huge values for the PC1? Would the NAs genotypes that I replaced cause this big effect?
Any help would be very much appreciated. Thanks. Paula.
How much missing data do you have?
Hey Sean, I excluded animals with more than 3% of missing genotypes and SNPs with more than 5%. So I don't have that much missing information and I also performed the PCA in the genomic relationship matrix. I don't understand why my first PC has extreme values. When I plot the second and third PCs I get exactly what I was expecting for the first and second. It looks like the first PC is capturing some error or something that I didn't get yet. Thanks.