Question: Compute correlation for depth of coverage between 2 datasets with different length in R
0
gravatar for Ana
15 months ago by
Ana170
Ana170 wrote:

Hello everyone, I am trying to compute correlation between depth of coverage for individuals with homozygote genotypes vs. individuals with heterozygote genotypes.

here is the fist few lines of my datasets (note, I have only one position across all individuals):

head(homozygotes)
   chrom       pos dp  ind_id genotype_id
1      1 115258827 12 HG00099         0|0
3      1 115258827  8 HG00101         0|0
4      1 115258827  6 HG00103         0|0
8      1 115258827  2 HG00114         0|0
9      1 115258827  8 HG00115         0|0
12     1 115258827  8 HG00128         0|0

head(heterozygotes)
   chrom       pos dp  ind_id genotype_id
2      1 115258827  5 HG00100         0|1
14     1 115258827  5 HG00133         0|1
16     1 115258827  5 HG00138         0|1
19     1 115258827  2 HG00160         1|0
27     1 115258827  4 HG00232         1|0
33     1 115258827  9 HG00251         1|0

these 2 datasets differ in length. Therefore, when I simply try cor.test(homozygotes$dp,heterozygotes$dp), I get an error message:

"Error in cor : incompatible dimensions.

I have searched to find a solution, but have not been able to find a solid solution. I am now quite stuck, does anyone have any idea how can I proceed and figure this out? I would sincerely appreciate it.

Thanks a lot

ADD COMMENTlink modified 15 months ago by zx87547.9k • written 15 months ago by Ana170

you need to match the columns. Down sample larger dataset.

ADD REPLYlink modified 15 months ago • written 15 months ago by cpad011211k
1
gravatar for Nicolas Rosewick
15 months ago by
Belgium, Brussels
Nicolas Rosewick8.0k wrote:

As you have multiple value per each genomic position you might try to use the median or the mean per position. And then perfom the correlation. Here I will use the mean but you could use median or max depending on the question you want to answer.

library(tidyverse)

homozygotes.mean <- homozygotes %>%
  group_by(chrom,pos) %>%
  summarise(dp=mean(dp))

heterozygotes.mean <- heterozygotes %>%
  group_by(chrom,pos) %>%
  summarise(dp=mean(dp))

Then only take genomic positions found in both dataframes. For this you can use GenomicRanges :

library(GenomicRanges)

homozygotes.mean.gr <- GRanges(homozygotes.mean$chrom,IRanges(homozygotes.mean$pos,homozygotes.mean$pos))
heterozygotes.mean.gr <- GRanges(heterozygotes.mean$chrom,IRanges(heterozygotes.mean$pos,heterozygotes.mean$pos))
ov <- findOverlaps( homozygotes.mean.gr, heterozygotes.mean.gr)
homozygotes.mean.common <- homozygotes.mean[queryHits(ov),]
heterozygotes.mean.common <- heterozygotes.mean[subjectHits(ov),]

Then you can perform the correlation

cor.test(homozygotes.mean.common$dp,heterozygotes.mean.common$dp)
ADD COMMENTlink modified 15 months ago • written 15 months ago by Nicolas Rosewick8.0k

i guess this is good only if genomic positions are equal in number in both the data (hetero and homozygotes).

ADD REPLYlink written 15 months ago by cpad011211k

No it works for different number of homo and heterozygote genotyopes. findOverlaps only reports positions that are in both dataframes and filter out the others.

ADD REPLYlink modified 15 months ago • written 15 months ago by Nicolas Rosewick8.0k

okay. got it. code is taking the mean of multiple DP values (if there are any) for any given coordinates and then considers only the shared genomic coordinates between the two frames, leaving out unmatched genomic coordinates. correct?

ADD REPLYlink written 15 months ago by cpad011211k

Yes that's it. In this example I used the mean but I could also use max or median

ADD REPLYlink written 15 months ago by Nicolas Rosewick8.0k

I guess there is a typo in these lines:

IRanges(homozygotes.mean$pos,homozygotes$pos) vs IRanges(heterozygotes.mean$pos,heterozygotes.mean$pos)

ADD REPLYlink modified 15 months ago • written 15 months ago by cpad011211k

Yes thanks I edited now.

ADD REPLYlink written 15 months ago by Nicolas Rosewick8.0k

I actually have only one potion in each data-set. Eventually I need to fix the script and run to for 1800 files. Each file only has one position. So I am guessing I do not need the second part of your code. However, when I tried cor.test(homozygotes.mean$dp,heterozygotes.mean$dp), I get an error message :

Error in cor.test.default(homozygotes.mean$dp, heterozygotes.mean$dp) : not enough finite observations

Any idea how to solve it? thanks

ADD REPLYlink written 15 months ago by Ana170

for detailed explanation: https://stackoverflow.com/questions/24659183/r-cor-test-not-enough-finite-observations copy/pasted:

If your vectors do not contain enough non-NA values (less than 3), the function will return the error.

up vote the OP in SO. Also check for NA values in your dataset.

ADD REPLYlink modified 15 months ago • written 15 months ago by cpad011211k

Yes, already found this. But, my data-set does not have any NA. I am not quite how this going to work. By calculating the mean I get one mean value for 0|0 genotype and one vale for heterozygote genotype. So, I will have only 2 values for the co.test!

ADD REPLYlink modified 15 months ago • written 15 months ago by Ana170

Please look into failing data frame dimensions or the length of dp vector post calculating mean/median/max/min.

ADD REPLYlink modified 15 months ago • written 15 months ago by cpad011211k

Not quite sure if I get your point. The error says I need to have at least 3 non-NA values for the correlation test. cor.test(homozygotes.mean.common$dp,heterozygotes.mean.common$dp) this computes correlation between only 2 values, which I guess is the reason of error.

ADD REPLYlink written 15 months ago by Ana170

minimum of 3 non-NA values in data (each), as I understand.

ADD REPLYlink modified 15 months ago • written 15 months ago by cpad011211k

No, I do not think so. I get this error from running cor.test and I run the cor.test between homozygotes.mean.common$dp (=7.61) and heterozygotes.mean.common$dp (=6.5). So only 2 values. I am not sure of this is the best approach to proceed! I am trying to find if there is any way to compute correlation for non-equal data frames, not taking the mean of each data frame and estimate correlation from the means.

ADD REPLYlink modified 15 months ago • written 15 months ago by Ana170
> hom
  chrom       pos dp  ind_id genotype_id
1     1 115258827 12 HG00099         0|0
2     1 115258827  8 HG00101         0|0
3     1 115258827  6 HG00103         0|0
4     1 115258827  2 HG00114         0|0
5     1 115258827  8 HG00115         0|0
6     1 115258827  8 HG00128         0|0
> het
  chrom       pos dp  ind_id genotype_id
1     1 115258827  5 HG00100         0|1
2     1 115258827  5 HG00133         0|1
3     1 115258827  5 HG00138         0|1
4     1 115258827  2 HG00160         1|0
5     1 115258827  4 HG00232         1|0
6     1 115258827  9 HG00251         1|0

> cor.test(hom$dp[1],het$dp[1])
Error in cor.test.default(hom$dp[1], het$dp[1]) : 
  not enough finite observations

> cor.test(hom$dp[1:2],het$dp[1:2])
Error in cor.test.default(hom$dp[1:2], het$dp[1:2]) : 
  not enough finite observations

> cor.test(hom$dp[1:3],het$dp[1:3])

    Pearson's product-moment correlation

data:  hom$dp[1:3] and het$dp[1:3]
t = NA, df = 1, p-value = NA
alternative hypothesis: true correlation is not equal to 0
sample estimates:
cor 
 NA 

Warning message:
In cor(x, y) : the standard deviation is zero
ADD REPLYlink modified 15 months ago • written 15 months ago by cpad011211k

Yes, I see you point. When to increase the dimension to 4 you will not get the error. But first, please note that the solution that Nicolas suggested computes correlation between the mean of DP for hom and het. So, it end up having 2 values and cor.test does not work for it. Second, the length of hom and het is not equal, so I will get an error message if I want to do cor.test(hom$dp,het$dp)

ADD REPLYlink written 15 months ago by Ana170

I compute the mean value per genomic position. So you will end with two dataframes with the genomic position common to both original dataframes. Each position containing the mean of the DP values for this position. Thus not one value...

ADD REPLYlink written 15 months ago by Nicolas Rosewick8.0k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1689 users visited in the last hour