Question

Extreme values for first PC - genomic data

0

Entering edit mode

8.9 years ago

Paula Sanchez • 0

Dear all:

I have done a PCA using R but I am getting extreme values for the first principal component.

The steps that I performed were:

I had a genotype file coded as: 0, 1, 2 or NA.

I replaced the NAs by 1 (heterozygous) as I transformed the matrix to -1, 0 and 1. So NAs would become zero.

I created the G matrix (VanRaden method) and applied the following command on G matrix:

mypca = prcomp(G, center=TRUE)

When I plot the first and second principal component I notticed huge values for PC1. When I plotted PC2 and PC3 I observed what I was expecting.

Do I need to scale the G matrix? What can be causing those huge values for the PC1? Would the NAs genotypes that I replaced cause this big effect?

Any help would be very much appreciated. Thanks. Paula.

pca SNP R • 1.7k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by Paula Sanchez • 0

0

Entering edit mode

How much missing data do you have?

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by Sean Davis 26k

0

Entering edit mode

Hey Sean, I excluded animals with more than 3% of missing genotypes and SNPs with more than 5%. So I don't have that much missing information and I also performed the PCA in the genomic relationship matrix. I don't understand why my first PC has extreme values. When I plot the second and third PCs I get exactly what I was expecting for the first and second. It looks like the first PC is capturing some error or something that I didn't get yet. Thanks.

ADD REPLY • link 8.9 years ago by Paula Sanchez • 0