Question: Problems with PCA for genotyped data
0
gravatar for doodle
19 months ago by
doodle30
doodle30 wrote:

Hello,

I have two different genotyped data sets - say A and B( Two very different populations). I have done PCA on A and it shows population clusters within the data without any pruning. Pruning removes most of the SNPs.

For the second part, I have to merge A and B and do a PCA on the merged data- this does not show any clusters without pruning. There was not much difference with pruning either.

Thirdly, I tried doing a PCA only on data set B and this also doesn't show population clusters with or without pruning. But from my phenotype data, I know that there is variation.

I did PCA using bfiles in Plink using the --pca flag.

Any suggestions please?

Thank you!

pca genotyped data • 462 views
ADD COMMENTlink written 19 months ago by doodle30
0
gravatar for Kevin Blighe
19 months ago by
Kevin Blighe66k
Kevin Blighe66k wrote:

What are your cut-offs that you are using while pruning? - that is key. Why are you so convinced that there should be population structure / clusters in B?

ADD COMMENTlink written 19 months ago by Kevin Blighe66k

I did --indep-pairwise using a cutoff of 50 5 0.2. What can be changed here?

ADD REPLYlink written 19 months ago by doodle30

There should be clusters in B because the phenotype data shows they are coming from different geographical regions.

ADD REPLYlink written 19 months ago by doodle30

Has it been shown already that these geographical regions have distinct genetic profiles?

ADD REPLYlink written 19 months ago by Kevin Blighe66k

Do you understand to what each of these numbers relates? You may need to adjust them based on your SNP density.

ADD REPLYlink written 19 months ago by Kevin Blighe66k

What I understand from this is that 50 is the window size within which variants which are highly correlated are removed, 5 is the step size and 0.2 is the r2 threshold. I don't understand how I can adjust them based on SNP density. Can you please help with that?

ADD REPLYlink written 19 months ago by doodle30

Which array data is it?

ADD REPLYlink written 19 months ago by Kevin Blighe66k

Both the data sets are imputed. A has 38 million markers and B has 28 million. Together they have about 64 million markers. Both were genotyped on illumina- A on illumina infinium GSA and B on a slightly older version- i'm not sure which one.

ADD REPLYlink modified 19 months ago • written 19 months ago by doodle30

Illumina has many arrays of differing genotype densities.

Look at it this way: if your SNPs are spaced 100 kilobase apart across the genome, then there is not much utility in using --indep-pairwise because the SNPs are already sparsely distributed. The idea of --indep-pairwise is to prune SNPs based on linkage equilibrium.

Another thing that you can look at is the MAF of your variants. You may want to remove rare variants, as these, by definition, will not be present in many samples and thus add minimal information to the type of analysis that you want to do.

ADD REPLYlink written 19 months ago by Kevin Blighe66k
1

Thank you so much Kevin! An --indep-pairwise cutoff of 1000 5 0.2 worked!

Sorry, I couldn't reply yesterday due to the messaging limit for new users.

ADD REPLYlink written 19 months ago by doodle30
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2226 users visited in the last hour