Question: MethylKit Error using PCASamples: Error in plot.window(...) : need finite 'xlim' values
0
gravatar for lucas.t
18 months ago by
lucas.t0
lucas.t0 wrote:

Hi,

I am using MethylKit to analyse gene methylation data and I am having trouble getting the PCASamples function to run. Initial sample files come from Bismark Coverage2Cytosine.

Using R 3.3.3, MethylKit 1.0.0

genome_fasta="~/Mus_musculus.GRCm38.fa"
input_files_base_dir="~"

file_list=file.path(input_files_base_dir,c("Control1.txt","Control2.txt","Control3.txt","Control4.txt","Control5.txt","Control6.txt","Test1.txt","Test2.txt","Test3.txt","Test4.txt","Test5.txt","Test6.txt")) 

ids = c("Control1","Control2","Control3","Control4","Control5","Control6","Test1","Test2","Test3","Test4","Test5","Test6")

treatment = c(0,0,0,0,0,0,1,1,1,1,1,1)

methData <- methRead( 
    location   = file_list %>% as.list, 
    sample.id  = ids %>% as.list, 
    assembly   = genome_fasta, 
    dbtype     = NA,
    pipeline   = "bismarkCytosineReport",   
    header   = FALSE,
    context    = "CpG",
    resolution = "base",
    treatment  = treatment,
    dbdir      = getwd(),
    mincov     = 10                      
)

meth     = unite(methData,destrand=FALSE)

PCAplot <- PCASamples(meth)
pdf(file="PCAPlot.pdf")
plot(PCAplot)

Error in plot.window(...) : need finite 'xlim' values
In addition: Warning messages:
1: In min(x) : no non-missing arguments to min; returning Inf
2: In max(x) : no non-missing arguments to max; returning -Inf
3: In min(x) : no non-missing arguments to min; returning Inf
4: In max(x) : no non-missing arguments to max; returning -Inf

I checked if I had any NA values in the columns using the following

M <- sapply(meth, function(x) sumis.na(x)))

However, it did not find any NA values. I'm pretty new to MethylKit and R so I'm looking for any ideas or next steps.

Thanks in advance!

methyl-seq R • 1.3k views
ADD COMMENTlink modified 18 months ago • written 18 months ago by lucas.t0

What are the contents of PCAplot?

ADD REPLYlink written 18 months ago by Kevin Blighe46k

The documentation regarding PCASamples is here, however I'm not sure how to describe it otherwise. str() and head() both return NULL but I assume PCAplot is a plot so that wouldn't work anyways. Due to the size of the initial files I'm working on a server but I don't have access to X11 forwarding so I can't use R Studio and actually see what the plot looks like. Any suggestions on how to get more data on the contents of PCAplot?

ADD REPLYlink written 18 months ago by lucas.t0

Please try base R code for the PCA plot. You can use my code, here: A: PCA plot from read count matrix from RNA-Seq

Just to see if you can actually get the plot generated.

ADD REPLYlink written 18 months ago by Kevin Blighe46k

Using the above process to get the variable "meth"

project.pca <- prcomp(t(meth))
Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric
In addition: Warning message:
In as.matrix.data.frame(x) :
  Setting class(x) to NULL;   result will no longer be an S4 object

Below is what meth looks like: str(meth)

'data.frame':   1072923 obs. of  40 variables:
Formal class 'methylBase' [package "methylKit"] with 13 slots
..@ .Data         :List of 40
.. ..$ : chr  "1" "1" "1" "1" ...
.. ..$ : int  3102686 3102699 3102703 3207635 3216495 3216496 3216614 3216615 3298814 3385053 ...
.. ..$ : int  3102686 3102699 3102703 3207635 3216495 3216496 3216614 3216615 3298814 3385053 ...
.. ..$ : chr  "+" "+" "+" "+" ...
.. ..$ : int  15 15 15 18 19 11 17 15 20 26 ...
.. ..$ : int  15 15 13 18 18 10 17 13 15 24 ...
.. ..$ : int  0 0 2 0 1 1 0 2 5 2 ...
.. ..$ : int  26 26 27 19 29 15 18 17 23 15 ...
.. ..$ : int  25 25 24 14 22 14 18 16 17 11 ...
.. ..$ : int  1 1 3 5 7 1 0 1 6 4 ...
.. ..$ : int  25 25 25 19 24 30 25 22 19 15 ...
.. ..$ : int  25 25 25 16 24 29 23 19 15 11 ...
.. ..$ : int  0 0 0 3 0 1 2 3 4 4 ...
.. ..$ : int  36 36 36 13 21 22 17 21 32 12 ...
.. ..$ : int  35 35 33 10 17 21 17 20 27 11 ...
.. ..$ : int  1 1 3 3 4 1 0 1 5 1 ...
.. ..$ : int  23 25 25 23 32 23 22 14 31 18 ...
.. ..$ : int  23 22 23 23 23 19 20 12 26 14 ...
 .. ..$ : int  0 3 2 0 9 4 2 2 5 4 ...
.. ..$ : int  27 28 28 22 17 11 11 12 32 23 ...
.. ..$ : int  27 26 28 17 16 9 11 9 25 21 ...
.. ..$ : int  0 2 0 5 1 2 0 3 7 2 ...
.. ..$ : int  23 24 23 18 14 18 14 15 23 30 ...
.. ..$ : int  23 24 23 12 13 15 13 15 20 26 ...
.. ..$ : int  0 0 0 6 1 3 1 0 3 4 ...
.. ..$ : int  21 21 23 18 21 18 16 17 22 25 ...
.. ..$ : int  21 21 21 14 17 14 15 16 17 22 ...
.. ..$ : int  0 0 2 4 4 4 1 1 5 3 ...
.. ..$ : int  27 28 28 24 29 30 21 22 15 11 ...
.. ..$ : int  26 27 27 16 27 28 21 20 13 7 ...
.. ..$ : int  1 1 1 8 2 2 0 2 2 4 ...
.. ..$ : int  20 20 20 23 21 17 13 21 19 19 ...
.. ..$ : int  20 19 19 22 16 16 12 20 14 16 ...
.. ..$ : int  0 1 1 1 5 1 1 1 5 3 ...
.. ..$ : int  19 21 21 22 31 21 20 20 25 16 ...
.. ..$ : int  16 20 19 20 27 20 20 18 18 11 ...
.. ..$ : int  3 1 2 2 4 1 0 2 7 5 ...
.. ..$ : int  13 13 13 23 14 29 15 23 18 21 ...
.. ..$ : int  13 13 12 21 14 24 14 21 17 15 ...
.. ..$ : int  0 0 1 2 0 5 1 2 1 6 ...
..@ sample.ids    : chr  "Control1" "Control2" "Control3" "Control4" ...
..@ assembly      : chr "~/Mus_musculus.GRCm38.fa"
..@ context       : chr "CpG"
..@ treatment     : num  0 0 0 0 0 0 1 1 1 1 ...
..@ coverage.index: num  5 8 11 14 17 20 23 26 29 32 ...
..@ numCs.index   : num  6 9 12 15 18 21 24 27 30 33 ...
..@ numTs.index   : num  7 10 13 16 19 22 25 28 31 34 ...
..@ destranded    : logi FALSE
..@ resolution    : chr "base"
..@ names         : chr  "chr" "start" "end" "strand" ...
..@ row.names     : int  1 2 3 4 5 6 7 8 9 10 ...
..@ .S3Class      : chr "data.frame"
ADD REPLYlink modified 18 months ago • written 18 months ago by lucas.t0

So, there's the issue: PCA can only be performed on numerical data with absolutely no missing values. You need to remove all textual data / characters. In this case, it looks like the first four columns are just genomic co-ordinates - you should remove these and then try again.

Hope that this makes sense - it should work fine when you remove those.

ADD REPLYlink written 18 months ago by Kevin Blighe46k

What type of data does PCA analysis require? The integer columns are actually in groups of three per animal as the following data points: Read Coverage - # of T - # of C. The interesting information is in the ratio of C to T at the different chromosomal coordinates across animals, so would I first calculate that ratio and input that?

ADD REPLYlink written 18 months ago by lucas.t0

PCA is a mathematical transformation that is fundamentally based on variance. Therefore, it just requires a matrix of numbers (no missing values permitted) of greater than 2 X 2 dimensions. Please take a look here for an expanded explanation: A: PCA in a RNA seq analysis

So, it can be performed on anything. Performing it on read coverage and ratios does not make much sense to me. It is typically applied to, for example:

  • Unfiltered gene expression counts (microarray or RNA-seq)
  • Genotype data (encoded numerically) in order to determine genetic relationships between samples

Not sure if that helps?

ADD REPLYlink written 18 months ago by Kevin Blighe46k

Yes it does help! And now I have two follow up questions:

  1. Does PCA determine variance by row? When PCA is collecting the data on the initial variance between samples that it will reduce into the set of principal components, this initial variance is calculated by looking at the variance between values that are contained within a row? So not having the gene coordinate information doesn't really matter, PCA will still match the appropriate values to each other to calculate the variance between the samples (columns)

  2. Would PCA work with ratios? In methyl-seq the important information is calculated by the number of cytosines that do not convert into thymine, which is calculated by the number of cytosines / the read count. To use the raw counts of cytosine doesn't really make sense as it wouldn't account for different numbers of reads per gene coordinate.

Thanks for all your help Kevin!

ADD REPLYlink written 18 months ago by lucas.t0

Yes, if we have samples as rows and genes as columns:

In a nutshell, it is summarising the variance (to be more precise, covariance) of the genes across each sample. The resulting principal components (PCs) are vectors that are uncorrelated and contain a single value for each sample. The values for the samples along PC1 will be the strongest; thus, PC1 will explain the largest amount of variance in the entire dataset. PC2 is second-greatest, PC3, third, et cetera.

It is possible to extract the component loadings from each PC, which show the strength of association of each gene to each PC. Genes of high variance across your samples will have high component loadings. If you do a RNA-seq of blood versus some other tissue, then the highest variance genes would most likely be HB (haemoglobin) genes.

If you've got a matrix of numbers, then you can do PCA. I'm just not sure of the standards in methyl-seq to comment further, i.e., I don't know what the performing of PCA on these ratios would mean. If they are the standard measures in methyl-seq, then fair enough!

Most people only use PCA to check for outliers. Those who understand it better know how to dig deeper.

ADD REPLYlink written 18 months ago by Kevin Blighe46k

So I've managed to run the pca analysis using your suggested code and it returns the following:

str(project.pca)
List of 5
$ sdev    : num [1:12] 22.3 21.8 21.7 21.4 21.3 ...
$ rotation: num [1:1072923, 1:12] 0.000324 -0.000354 0.000399 -0.000328 -0.001728 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : NULL
.. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
$ center  : num [1:1072923] 0.978 0.967 0.937 0.834 0.873 ...
$ scale   : logi FALSE
$ x       : num [1:12, 1:12] -32.323 -7.952 -4.466 -5.436 0.188 ...
..- attr(*, "dimnames")=List of 2
.. ..$ : chr [1:12] "Control1" "Control2" "Control3" "Control4" ...
.. ..$ : chr [1:12] "PC1" "PC2" "PC3" "PC4" ...
- attr(*, "class")= chr "prcomp"

One thing I wanted to check is if I've applied your code properly. I noticed there was a transpose function in your suggested code and the presence of 12 PCAs when I have 12 samples makes little alarms go off.

ADD REPLYlink written 18 months ago by lucas.t0

Yep, you're digging a little too deep here ;)

The maximum number of principal components (PCs, also called eigenvectors) that can be produced is limited by the dimensions of your input matrix. So, the transformation just produces 12 because it's just performing matrix multiplications on your 12-sample data matrix in the background. Not sure if this helps. If you had 45 samples, you'd get 45 eigenvectors.

A little unknown fact: the prcomp() function in R, which advertises itself as performing principal components analysis (PCA), is actually performing a data transformation called single value decomposition (SVD). Coincidentally, they end up producing the same values. Other quirks like this exist in statistics, such as conditional logistic regression and Cox proportional hazards, I believe.

If you're interested in looking at the formulae for PCA, you could look at the Appendix of my white paper: Haplotype Classification Using Copy Number Variation and Principal Components Analysis (I'm long gone from that institution though)

ADD REPLYlink written 18 months ago by Kevin Blighe46k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 565 users visited in the last hour