Question: How Can I Do Principal Components Analysis ?
gravatar for mary
5.3 years ago by
Bologna university
mary150 wrote:

Dear friend I have SNP data for 15 small sheep populations (20 samples for each population) and they were sequenced by ovine SNP50K, each breed sequenced on one plate. Now, I want to do CNV analysis by PennCNV. I know I should do Principal Components analysis before starting CNV analysis. Does anybody know how i can do that with my data?


cnv • 4.6k views
ADD COMMENTlink modified 5.3 years ago by Ryan D3.2k • written 5.3 years ago by mary150

EIGENSTRAT can do it.

ADD REPLYlink written 5.3 years ago by Maxime Lamontagne1.9k

Nowhere in the PennCNV documentation do I see a requirement to perform PCA before any other type of analysis. What makes you think this is necessary? Do you know exactly what it is that you want to do and which data are required as input?

ADD REPLYlink written 5.3 years ago by Neilfws47k

CNV analysis itself does not require PCA. But it is a good idea, having different populations mixed in the same dataset, to check (e.g. for population structure) with PCA or other exploratory techniques.

ADD REPLYlink written 5.3 years ago by

Sure, PCA is a useful, general method for multivariate data. I just want to check that when the questioner says "I know I should do Principal Components analysis before start to analysis CNV", they really know what they are doing.

ADD REPLYlink written 5.3 years ago by Neilfws47k

I agree with both your comments, my intention was just to clarify that even if not required by pennCNV, PCA is a useful step.

ADD REPLYlink written 5.3 years ago by

Its a minor issue. But in my opinion you should not describe your approach here as sequencing. You are genotyping by SNP array. Sequencing is another approach you could use to determine SNP genotypes. There is a nice explanation of sequencing versus genotyping at 23andMe.

ADD REPLYlink modified 5.3 years ago • written 5.3 years ago by Obi Griffith16k
gravatar for Matt Shirley
5.3 years ago by
Matt Shirley8.0k
Cambridge, MA
Matt Shirley8.0k wrote:

It is a good QC step to perform PCA before CNV calling. You'll want to read the LogR ratio values from your signal files into R:

LRR <- c()
for (file in file_list) {
    x <- read.table(file, header=TRUE, sep="\t")
    LRR <- cbind(LRR, x$Log.R.Ratio)

Then use the builtin princomp function and plot your first and second principle components:

PCA <- princomp(LRR)
biplot(PCA[,1], PCA[,2])

There are probably issues with the R code above, and if I have time I'll actually run it and fix any mistakes, but this should get you started. If you don't see any discreet groupings of subsets or individual samples of your data, then you may keep all of your arrays in the analysis. If you see any discreet groupings, you might inspect the arrays corresponding to those groupings and possibly remove them before CNV calling.

ADD COMMENTlink written 5.3 years ago by Matt Shirley8.0k
gravatar for Ryan D
5.3 years ago by
Ryan D3.2k
Ryan D3.2k wrote:

One great piece of software which does principal components and determines the best fit of your data is CNVtools, by Chris Barnes and Vincent Plagnol. This was software developed to handle batch issues that frequently arise in the course of calling CNVs. The paper describing it is in Nature Genetics.

You can just follow the short vignette given on the above page. As long as you specify your batches, you will have a high chance of being able to call CNVs properly. The software runs you through a nice little pipeline of steps for CNV analysis including plotting the raw signal and 1st principal component of your CNVs, selecting the model with the correct number of principal components, improving the calling of CNVs using this information, and testing the association of your CNVs with qualitative or continuous traits. first principal component from the intensity data from different probes. This, by downweighting probe intensities uncorrelated with the remainder, generally gives a better separation of different copy numbers than the mean or median of all of the probe intensities.

The downside of the code is that it is written to do one CNV at a time. You must know the probes that comprise your CNV. We have modified the file to do batches of CNVs, but you still must know where your CNV starts and ends. This is different than CNV calling of course (which would use software like PennCNV or Birdsuite). It is much more rigorous and reviewers at top journals will ask you for this level of validation.

If you really want to plot other components, you can use this code together with CNVtools as I illustrate below:


data= read.table("gwas_raw_data.txt", header=T, sep="\t")
dim(data); names(data)

raw.signal <- as.matrix(data[, -(1:7)])
dimnames(raw.signal)[[1]] <- data$Sample_ID

mean.signal <- apply(raw.signal, MAR=1, FUN=mean)
pca.signal <- apply.pca(raw.signal)
hist(mean.signal, breaks=50, main='Mean signal', cex.lab=1.3)
hist(pca.signal, breaks=50, main='First PCA signal', cex.lab=1.3)

aa0 = prcomp(raw.signal)
tmp1 = (aa0$x[,1] - mean(aa0$x[,1])) / sd(aa0$x[,1])
plot(pca.signal, tmp1)

aa = prcomp(raw.signal, center=F, scale=T)
tmp2 = (aa$x[,1] - mean(aa$x[,1])) / sd(aa$x[,1])
plot(pca.signal, tmp2)

PC1 = aa$x[,1]; PC2 = aa$x[,2]; PC3 = aa$x[,3]
idx1 = data$DNA_source=="batch1"
idx2 = data$DNA_source=="batch2" 
idx3 = data$DNA_source=="batch3"

plot(PC1, PC2, type='n', main="All sources")
points(PC1[idx1], PC2[idx1])
points(PC1[idx2], PC2[idx2], col=2)
points(PC1[idx3], PC2[idx3], col=3)
plot(PC1, PC2, type='n', main="batch1")
points(PC1[idx1], PC2[idx1])
plot(PC1, PC2, type='n', main="batch2")
points(PC1[idx2], PC2[idx2], col=2)
plot(PC1, PC2, type='n', main="batch3")
points(PC1[idx3], PC2[idx3], col=3)
ADD COMMENTlink written 5.3 years ago by Ryan D3.2k
gravatar for Damian Kao
5.3 years ago by
Damian Kao14k
Damian Kao14k wrote:

I posted some code on how to do it with python and matplotlib a while back:

ADD COMMENTlink written 5.3 years ago by Damian Kao14k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1278 users visited in the last hour