Question: How Can I Do Principal Components Analysis ?
gravatar for mary
6.2 years ago by
Bologna university
mary190 wrote:

Dear friend I have SNP data for 15 small sheep populations (20 samples for each population) and they were sequenced by ovine SNP50K, each breed sequenced on one plate. Now, I want to do CNV analysis by PennCNV. I know I should do Principal Components analysis before starting CNV analysis. Does anybody know how i can do that with my data?


cnv • 5.6k views
ADD COMMENTlink modified 7 months ago by thaihoabo70 • written 6.2 years ago by mary190

EIGENSTRAT can do it.

ADD REPLYlink written 6.2 years ago by Maxime Lamontagne2.1k

Nowhere in the PennCNV documentation do I see a requirement to perform PCA before any other type of analysis. What makes you think this is necessary? Do you know exactly what it is that you want to do and which data are required as input?

ADD REPLYlink written 6.2 years ago by Neilfws48k

CNV analysis itself does not require PCA. But it is a good idea, having different populations mixed in the same dataset, to check (e.g. for population structure) with PCA or other exploratory techniques.

ADD REPLYlink written 6.2 years ago by

Sure, PCA is a useful, general method for multivariate data. I just want to check that when the questioner says "I know I should do Principal Components analysis before start to analysis CNV", they really know what they are doing.

ADD REPLYlink written 6.2 years ago by Neilfws48k

I agree with both your comments, my intention was just to clarify that even if not required by pennCNV, PCA is a useful step.

ADD REPLYlink written 6.2 years ago by

Its a minor issue. But in my opinion you should not describe your approach here as sequencing. You are genotyping by SNP array. Sequencing is another approach you could use to determine SNP genotypes. There is a nice explanation of sequencing versus genotyping at 23andMe.

ADD REPLYlink modified 6.2 years ago • written 6.2 years ago by Obi Griffith17k
gravatar for Matt Shirley
6.2 years ago by
Matt Shirley8.7k
Cambridge, MA
Matt Shirley8.7k wrote:

It is a good QC step to perform PCA before CNV calling. You'll want to read the LogR ratio values from your signal files into R:

LRR <- c()
for (file in file_list) {
    x <- read.table(file, header=TRUE, sep="\t")
    LRR <- cbind(LRR, x$Log.R.Ratio)

Then use the builtin princomp function and plot your first and second principle components:

PCA <- princomp(LRR)
biplot(PCA[,1], PCA[,2])

There are probably issues with the R code above, and if I have time I'll actually run it and fix any mistakes, but this should get you started. If you don't see any discreet groupings of subsets or individual samples of your data, then you may keep all of your arrays in the analysis. If you see any discreet groupings, you might inspect the arrays corresponding to those groupings and possibly remove them before CNV calling.

ADD COMMENTlink written 6.2 years ago by Matt Shirley8.7k
gravatar for Ryan D
6.2 years ago by
Ryan D3.3k
Ryan D3.3k wrote:

One great piece of software which does principal components and determines the best fit of your data is CNVtools, by Chris Barnes and Vincent Plagnol. This was software developed to handle batch issues that frequently arise in the course of calling CNVs. The paper describing it is in Nature Genetics.

You can just follow the short vignette given on the above page. As long as you specify your batches, you will have a high chance of being able to call CNVs properly. The software runs you through a nice little pipeline of steps for CNV analysis including plotting the raw signal and 1st principal component of your CNVs, selecting the model with the correct number of principal components, improving the calling of CNVs using this information, and testing the association of your CNVs with qualitative or continuous traits. first principal component from the intensity data from different probes. This, by downweighting probe intensities uncorrelated with the remainder, generally gives a better separation of different copy numbers than the mean or median of all of the probe intensities.

The downside of the code is that it is written to do one CNV at a time. You must know the probes that comprise your CNV. We have modified the file to do batches of CNVs, but you still must know where your CNV starts and ends. This is different than CNV calling of course (which would use software like PennCNV or Birdsuite). It is much more rigorous and reviewers at top journals will ask you for this level of validation.

If you really want to plot other components, you can use this code together with CNVtools as I illustrate below:


data= read.table("gwas_raw_data.txt", header=T, sep="\t")
dim(data); names(data)

raw.signal <- as.matrix(data[, -(1:7)])
dimnames(raw.signal)[[1]] <- data$Sample_ID

mean.signal <- apply(raw.signal, MAR=1, FUN=mean)
pca.signal <- apply.pca(raw.signal)
hist(mean.signal, breaks=50, main='Mean signal', cex.lab=1.3)
hist(pca.signal, breaks=50, main='First PCA signal', cex.lab=1.3)

aa0 = prcomp(raw.signal)
tmp1 = (aa0$x[,1] - mean(aa0$x[,1])) / sd(aa0$x[,1])
plot(pca.signal, tmp1)

aa = prcomp(raw.signal, center=F, scale=T)
tmp2 = (aa$x[,1] - mean(aa$x[,1])) / sd(aa$x[,1])
plot(pca.signal, tmp2)

PC1 = aa$x[,1]; PC2 = aa$x[,2]; PC3 = aa$x[,3]
idx1 = data$DNA_source=="batch1"
idx2 = data$DNA_source=="batch2" 
idx3 = data$DNA_source=="batch3"

plot(PC1, PC2, type='n', main="All sources")
points(PC1[idx1], PC2[idx1])
points(PC1[idx2], PC2[idx2], col=2)
points(PC1[idx3], PC2[idx3], col=3)
plot(PC1, PC2, type='n', main="batch1")
points(PC1[idx1], PC2[idx1])
plot(PC1, PC2, type='n', main="batch2")
points(PC1[idx2], PC2[idx2], col=2)
plot(PC1, PC2, type='n', main="batch3")
points(PC1[idx3], PC2[idx3], col=3)
ADD COMMENTlink written 6.2 years ago by Ryan D3.3k
gravatar for Damian Kao
6.2 years ago by
Damian Kao15k
Damian Kao15k wrote:

I posted some code on how to do it with python and matplotlib a while back:

ADD COMMENTlink written 6.2 years ago by Damian Kao15k
gravatar for thaihoabo
7 months ago by
thaihoabo70 wrote:

enter image description here You probably can work with R to do PCA. But be careful whether to use with/without centering, scaling, or log. It depends on the way you calculated CNV. Is it raw count or normalized? But anyway, maybe it'd be less painful if you use a tool, which has all necessary parameters:

ADD COMMENTlink modified 7 months ago • written 7 months ago by thaihoabo70
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1083 users visited in the last hour