color coded PCA plot in R for a combined set
1
0
Entering edit mode
2.7 years ago
raalsuwaidi ▴ 100

hi all,

i am trying to perform a PCA on a local data set in comparison to the 1000 genome dataset. for that i merged the two sets using bcftools with the below command:

bcftools merge   --force-samples   -m none  1000genome_normalized.vcf.gz  local_normalized.vcf.gz  -Oz -o Merged.vcf.gz

after that i used plink 1.9 to perform the PCA :

plink --vcf Merged.vcf.gz  --pca var-wts  header --out Merged-plink1.9

i also did the same with plink 2.0 and got the same result.

when i got the eigenvec values and plot them through R and got the below output

enter image description here

now i need to color code the plot to show the 1000 genome samples as opposed to my local samples. how can this be performed, knowing that when i checked the eigenvec file i found that all family ids changed to 0!!! but i still have the sample IDs. so is there a way where i can group the samples and indicate to R to color code it accordingly?

for clarification, here is a row from the eigenvec file representing the 1000genome samples:

0 HG00096 -0.0117915 0.0241081 0.0132739 -0.0150777 -0.0034903 0.00428502 -0.00142438 0.0305039 -0.00026421 0.000715683 -0.000358925 0.00548679 -0.00263568 0.00154371 0.0028039 0.000966683 -0.00173037 0.00517797 -0.00036687 -0.00965849

and here is a row representing the local dataset:

0 13590 -0.0119216 0.0149963 0.00264354 -0.00849061 -0.0044652 0.00106673 -0.00368617 -0.00693559 0.0069702 0.0024315 0.00691328 -0.019868 0.0171684 -0.0075124 -0.0335678 0.0382487 0.0106169 0.0207575 -0.00615446 -0.0858812

it is clear that the 1000genome values start with alphabet while the local doesn't. will this be applicable in R? if so, then how can i use it and plot a colored scatter diagram? i am very sorry for this long question, but i have zero experience in R and wanted to make all the data available

plink pca plot • 2.0k views
ADD COMMENT
2
Entering edit mode
2.7 years ago
Sam ★ 4.7k

You should be able to get the 1000 Genome sample ID directly from here

assuming your file name for the eigen vector is eigenvec with header of FID IID PC1 PC2 ... and you downloaded the ped file as ped which has no header

library(ggplot2)
dat <- read.table("eigenvec")
ped <- read.table("ped")
dat$Group <- "Self"
dat$Group[dat$IID %in% ped$V2] <- "1000G"
ggplot(dat, aes(x=PC1, y=PC2, color=Group))+geom_point()+theme_classic()
ADD COMMENT
0
Entering edit mode

thanks for such an answer. seems very simple. but for some reason its not working for me. i get the below diagram which doesnt represent the PCA values achieved from my previous plot code enter image description here

previously i used the following code:

pcs = read.table( "Merged.eigenvec" )
plot( pcs[,3], pcs[,4], xlab = "PC1", ylab = "PC2" )

any help how i can color code it in R?

i am also putting a sample of the eigenvec file for simplicity:

0 HG00096 -0.0117915 0.0241081 0.0132739 -0.0150777 -0.0034903 0.00428502 -0.00142438 0.0305039 -0.00026421 0.000715683 -0.000358925 0.00548679 -0.00263568 0.00154371 0.0028039 0.000966683 -0.00173037 0.00517797 -0.00036687 -0.00965849
0 HG00097 -0.0113053 0.0251497 0.0152515 -0.015496 -0.004696 0.00193739 -0.00274131 0.0236092 -0.00105063 0.00149943 0.00106529 -0.00479935 0.00189297 -0.000271566 -0.00400817 0.00230891 3.42846e-05 0.00116352 -0.00228412 -0.00209324
0 HG00099 -0.0117451 0.0226343 0.0149918 -0.0138922 -0.00186789 0.00424292 -0.00201276 0.0293248 0.00444327 -0.000205344 0.0031077 0.00051605 0.000578706 -0.000361561 0.000974718 0.000264026 -0.00128277 0.00379693 -0.00110405 -0.00398658
0 HG00100 -0.0116673 0.0238363 0.0128449 -0.0162229 -0.00203551 0.00175405 -0.0050617 0.0202759 -0.000643143 0.00384422 -0.00241591 0.0074136 -0.000811983 0.00487814 -0.00315148 0.00219278 -0.000496814 0.00190259 0.000549687 -0.0020432
0 HG00101 -0.0115503 0.0238441 0.0126025 -0.0128248 -0.0028248 0.0021252 0.000207516 0.0142481 0.00280482 0.000720462 -0.00110958 0.00567276 -0.00122438 0.00180973 0.000708259 0.000350994 -0.00211251 0.00242939 -0.00332881 -0.00644632
0 HG00102 -0.0115556 0.0243618 0.0148178 -0.0124657 -0.00178883 0.003968 -0.00330565 0.0190996 0.00441863 0.00057308 0.000216066 -0.00115824 0.0016823 -0.000461557 -0.00171107 0.00057821 0.00153259 0.00119274 -0.00142554 0.00177594
0 HG00103 -0.0110027 0.0243973 0.0121508 -0.0167823 -0.00245875 0.00331687 0.000878283 0.0240025 0.000747382 -0.00169204 -0.00033244 -0.00376465 0.000356715 0.000584461 -0.00261126 0.00237357 0.00259776 -0.00102373 -0.00366591 -0.00902969
0 HG00105 -0.0110591 0.022931 0.0125505 -0.0145519 -0.00245751 0.0034476 0.000270511 0.0288075 0.0010955 -0.000676657 0.000697252 0.00478386 -0.00536402 -0.000163368 -0.00189925 0.00681636 0.0020999 0.00316904 -0.00158194 -0.00804682
0 HG00106 -0.0112527 0.0245407 0.0121229 -0.0152224 -0.00238048 0.00408118 0.00029196 0.0221795 0.00112468 0.00271623 0.00351889 0.00281814 -0.00196288 0.000316554 -0.00349278 0.00284159 -0.00438661 0.00438993 0.00315211 -0.0134236
ADD REPLY
1
Entering edit mode

It might be that your eigenvec file doesn't contain the header (I assumed you've added the header. In that case, you should do something like:

library(ggplot2)
dat <- read.table("eigenvec")
ped <- read.table("ped")
dat$Group <- "Self"
dat$Group[dat$V2 %in% ped$V2] <- "1000G"
ggplot(dat, aes(x=V3, y=V4, color=Group))+geom_point()+theme_classic()
ADD REPLY
0
Entering edit mode

thanks. it works

ADD REPLY

Login before adding your answer.

Traffic: 2052 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6