Question

Principal Component Analysis

0

Entering edit mode

7.2 years ago

krishnapashu912 ▴ 40

Hi,

Could anyone help me out with following question,

I want to perform Principal Component Analysis (PCA) on Genotype input data for SNPtest.

I know how to perform PCA on the type of genotype data where SNPs are just the genotypes (coded as 0, 1 or 2).

However, in the file format for SNPTEST, each SNP is represented as a set of three probabilities which correspond to the allele pairs AA,AB,BB. How can I perform PCA on this data?

I was thinking to apply some threshold, for example 0.9 and select genotypes that has probability >= 0.9. I would drop the SNPs that does not have any genotype with at least 0.9 probability. I am not sure if this approach is valid!

I would appreciate any suggestions on this! Thank you!

best regards, Krishna

PCA GWAS SNPTEST • 3.1k views

ADD COMMENT • link updated 7.2 years ago by Vivek ★ 2.7k • written 7.2 years ago by krishnapashu912 ▴ 40

1

Entering edit mode

I've never tried this and I won't pretend to be a GWAS expert, but I would try to just run the PCA with the data as it is. You might need to "tidy" the data into the following format:

Position/genotype    Sample1    Sample2   ...
pos1_AA              0.9        0.85
pos1_AB              0.05       0.1
pos1_BB              0.05       0.05
...

I would presume that that would produce reasonable PCA results.

ADD REPLY • link 7.2 years ago by Devon Ryan 104k

0

Entering edit mode

Thank you Devon!

I am going to try that!

ADD REPLY • link 7.2 years ago by krishnapashu912 ▴ 40

0

Entering edit mode

7.2 years ago

GabrielMontenegro ▴ 670

ANGSD, a software for analyzing NGS data, has an implementation for PCA based on genotyping probabilities. You could give that a try: http://www.popgen.dk/angsd/index.php/PCA. It also takes into account depth of your sequence is that is the case.

Also, about what you propose. How do this genotyping probabilities look like? If say one is 0.91, the other 0.90 and the other 0.4, choosing the highest based on your cut-off would not be that "reliable" (for lack of a better word). Maybe some ratio-test would be better?

ADD COMMENT • link 7.2 years ago by GabrielMontenegro ▴ 670

0

Entering edit mode

Thank you! I will give it a try!

ADD REPLY • link 7.2 years ago by krishnapashu912 ▴ 40

score 2 · Accepted Answer · 2017-01-30

2

Entering edit mode

7.2 years ago

Vivek ★ 2.7k

I think you are describing the Oxford gen/sample format here. You can use something like qctool or gtool to convert them to PLINK binary format and use the standard PCA tools.