Find particular SNPs affecting PCA
1
1
Entering edit mode
8.3 years ago
lovestowell ▴ 10

Hey there,

I have a set of ~13K SNPs across 16 individuals, 8 from each of two different species. Individuals were sequenced in two different sequencing runs. I am using the R package SNPRelate to calculate relatedness between pairs and to visualize the individuals in PCA space. The problem is that PCA is separating not only by species but also by sequencing run. This pattern persists even if I remove SNPs that are missing in many individuals.

I would like to find the particular SNPs that are separating the different sequencing runs and remove them from analyses. Any suggestions on finding the SNPs that are contributing to this pattern?

-S

PCA SNPRelate SNP • 2.4k views
ADD COMMENT
0
Entering edit mode

I'm sorry. This is my first time doing this kind of analysis, and I don't understand how the structure of your PCA object looks like. I'm using the SNPRelate package to do my PCAs, and the loading object looks like this:

List of 8
 $ sample.id : chr [1:3893] "AF346967.1" "AF346968.1" "AF346969.1" "AF346976.1" ...
 $ snp.id    : chr [1:3869] "A10005G" "A10018G" "A10032G" "A10039G" ...
 $ eigenval  : num [1:3893] 45.1 42.3 NaN NaN NaN ...
 $ snploading: num [1:2, 1:3869] -0.008308 -0.000411 0.000923 0.001307 0.001924 ...
 $ TraceXTX  : num 29776992
 $ Bayesian  : logi FALSE
 $ avefreq   : num [1:3869] 0.001027 0.000257 0.000771 0.001027 0.001027 ...
 $ scale     : num [1:3869] 44.1 88.2 51 44.1 44.1 ...
 - attr(*, "class")= chr "snpgdsPCASNPLoadingClass"

How can I extract the loadings for the PC that I want with its associated snp.id? And how is this different from the snp correlation analysis from this tutorial? Any help will be welcomed!

ADD REPLY
0
Entering edit mode

Hi rturba07, could you solve it? I'm trying to do the same, and I'm having difficulties. Any advice will help!

ADD REPLY
0
Entering edit mode

well, this project got in the backburner for awhile and i'll soon go back to it. but taking a look at what i posted and the dataset, i think what i would try to do is extract both variables $snploading and $snp.id, combine them in a data.frame and order by value, then extract the top 10 ones or something like that. does that make sense?

ADD REPLY
0
Entering edit mode

Well, that's exactly what I did. As I'm using only PC1 and PC2 to separate my populations I used the code below to create a dataframe with the loadings for PC1 and PC2: write.table(SNPLoad$snploading[1,], file="snploading.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = ",") write.table(SNPLoad$snploading[2,], file="snploading2.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = ",") write.table(SNPLoad$snp.id, file="snpid.txt", quote = FALSE, col.names = TRUE, row.names = FALSE, sep = ",") So I have now a table that looks like: SNPid PC1 PC2 63492565 0,094750109 -0,001474008 49041906 0,082402028 -0,000652574 I think this could be what I needed, but maybe I should test a subset of these SNPs in a new PCA to see if they are useful markers.

ADD REPLY
1
Entering edit mode
8.3 years ago
mkulecka ▴ 360

Your PCA object should contain something like matrix of variable loadings. If you extract it, you could see which variables are strongly correleated with PC1, PC2 etc. For example, for object of class prcomp it can be done like this:

#pcobj is PCA object
v <- pcobj$rotation
v <- as.data.frame(v)
v <- v[with(v,order(PC1)),]
v <- rbind(head(v),tail(v)) #correlation can be either negative or positive
ADD COMMENT

Login before adding your answer.

Traffic: 3000 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6