Question: Principal Component Analysis
0
gravatar for krishnapashu912
2.2 years ago by
krishnapashu91220 wrote:

Hi,

Could anyone help me out with following question,

I want to perform Principal Component Analysis (PCA) on Genotype input data for SNPtest.

I know how to perform PCA on the type of genotype data where SNPs are just the genotypes (coded as 0, 1 or 2).

However, in the file format for SNPTEST, each SNP is represented as a set of three probabilities which correspond to the allele pairs AA,AB,BB. How can I perform PCA on this data?

I was thinking to apply some threshold, for example 0.9 and select genotypes that has probability >= 0.9. I would drop the SNPs that does not have any genotype with at least 0.9 probability. I am not sure if this approach is valid!

I would appreciate any suggestions on this! Thank you!

best regards, Krishna

snptest pca gwas • 1.4k views
ADD COMMENTlink modified 2.2 years ago by Vivek2.2k • written 2.2 years ago by krishnapashu91220
1

I've never tried this and I won't pretend to be a GWAS expert, but I would try to just run the PCA with the data as it is. You might need to "tidy" the data into the following format:

Position/genotype    Sample1    Sample2   ...
pos1_AA              0.9        0.85
pos1_AB              0.05       0.1
pos1_BB              0.05       0.05
...

I would presume that that would produce reasonable PCA results.

ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Devon Ryan89k

Thank you Devon!

I am going to try that!

ADD REPLYlink written 2.2 years ago by krishnapashu91220
2
gravatar for Vivek
2.2 years ago by
Vivek2.2k
Denmark
Vivek2.2k wrote:

I think you are describing the Oxford gen/sample format here. You can use something like qctool or gtool to convert them to PLINK binary format and use the standard PCA tools.

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by Vivek2.2k

Thank you! This is exactly what I was looking for!

ADD REPLYlink written 2.2 years ago by krishnapashu91220
0
gravatar for GabrielMontenegro
2.2 years ago by
United Kingdom
GabrielMontenegro520 wrote:

ANGSD, a software for analyzing NGS data, has an implementation for PCA based on genotyping probabilities. You could give that a try: http://www.popgen.dk/angsd/index.php/PCA. It also takes into account depth of your sequence is that is the case.

Also, about what you propose. How do this genotyping probabilities look like? If say one is 0.91, the other 0.90 and the other 0.4, choosing the highest based on your cut-off would not be that "reliable" (for lack of a better word). Maybe some ratio-test would be better?

ADD COMMENTlink modified 2.2 years ago • written 2.2 years ago by GabrielMontenegro520

Thank you! I will give it a try!

ADD REPLYlink written 2.2 years ago by krishnapashu91220
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1803 users visited in the last hour