Question

copynumber package in R - basic R question

3

Entering edit mode

6.5 years ago

lien ▴ 90

Hi,

I would like to use the plotFreq function that is included in the copynumber package( available via BioConductor in R). I have sequencing data that I converted to a format which I think is the right one for the plotFreq function. The package itself contains a test-dataset called "lymphoma.res" and it looks like this:

   > lymphoma.res[1:10,]
       sampleID chrom arm start.pos   end.pos n.probes    mean
    1    X01.B1     1   p   1082138  65355304       71 -0.0432
    2    X01.B1     1   p  66680940 119515493       57  0.0462
    3    X01.B2     1   p   1082138  50340843       56 -0.0442
    4    X01.B2     1   p  50638118 107342958       58  0.0378
    5    X01.B2     1   p 108332010 119515493       14 -0.0271
    6    X01.B3     1   p   1082138  66680940       72 -0.0628
    7    X01.B3     1   p  67244579  70705534        5 -0.4119
    8    X01.B3     1   p  71950650 119515493       51 -0.0355
    9    X03.B1     1   p   1082138  50340843       56 -0.0376
    10   X03.B1     1   p  50638118 119515493       72  0.0178

This works perfectly when I run the plotFreq function. However, this plotFreq doens't run when I use my own data (a). However, I think they look quite similar:

> a[1:10,]
        sampleID chrom arm start.pos  end.pos n.probes   mean
1  GC041544AR001     1   p    887078  1920182       10  3.658
2  GC043725AR013     1   p   1282158  2340523       10  4.884
3  GC041544AR001     1   p   2694253  7633559       10  5.635
4  GC041544AR001     1   p   7633560 10932033       10 12.726
5  GC041544AR001     1   p  11507192 14281220       10 24.825
6  GC041544AR001     1   p  15339635 16417326       10 14.902
7  GC041544AR001     1   p  17084389 18335685       10  8.939
8  GC041544AR001     1   p  18335686 21220367       10 25.313
9  GC041544AR001     1   p  21220368 25194985       10 16.479
10 GC041544AR001     1   p  25194986 26836784       10 14.579

When I compare both formats:

> mode(lymphoma.res)
[1] "list"
> mode(a)
[1] "list"

But when I try to run the plotFreq function with my data, I get the following error:

> plotFreq(segments=lymphoma.res, thres.gain=0.2, thres.loss=-0.1)
Hit <Return> to see next plot: 
> plotFreq(segments=a, thres.gain=0.2, thres.loss=-0.1)
Error in `[<-.data.frame`(`*tmp*`, z$chr == chrom[j], 2 + i, value = numeric(0)) : 
  replacement has length zero

I only have limited knowledge of R, so I might be missing some basic things here. But does anyone have a suggestion I could try?

Thanks a lot

copynumber R plotFreq data input • 3.9k views

ADD COMMENT • link updated 6.5 years ago by cpad0112 21k • written 6.5 years ago by lien ▴ 90

0

Entering edit mode

Are you sure that you have numeric, character/text, and factor columns where they should be?

Can you post the output of the str() function on both of your objects? - there, you may see a difference

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Thanks. Great suggestion. Indeed, some columns were not correct (for example factors instead of characters and integer instead of numeric). With the help of 'Google' I managed to convert my input files such that it contains the same columns as the test-dataset. Test-dataset (lymphoma.res):

> str(lymphoma.res)
'data.frame':   1529 obs. of  7 variables:
 $ sampleID : chr  "X01.B1" "X01.B1" "X01.B2" "X01.B2" ...
 $ chrom    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ arm      : chr  "p" "p" "p" "p" ...
 $ start.pos: int  1082138 66680940 1082138 50638118 108332010 1082138 67244579 71950650 1082138 50638118 ...
 $ end.pos  : int  65355304 119515493 50340843 107342958 119515493 66680940 70705534 119515493 50340843 119515493 ...
 $ n.probes : num  71 57 56 58 14 72 5 51 56 72 ...
 $ mean     : num  -0.0432 0.0462 -0.0442 0.0378 -0.0271 ...

Which corresponds to my own dataset amut:

> str(amut)
'data.frame':   168 obs. of  7 variables:
 $ sampleID : chr  "GC041544AR001" "GC043725AR013" "GC041544AR001" "GC041544AR001" ...
 $ chrom    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ arm      : chr  "p" "p" "p" "p" ...
 $ start.pos: int  887078 1282158 2694253 7633560 11507192 15339635 17084389 18335686 21220368 25194986 ...
 $ end.pos  : int  1920182 2340523 7633559 10932033 14281220 16417326 18335685 21220367 25194985 26836784 ...
 $ n.probes : num  10 10 10 10 10 10 10 10 10 10 ...
 $ mean     : num  3.66 4.88 5.63 12.73 24.82 ...

However, when I try running the plotFreq, the same error remains as before:

> plotFreq(segments=amut, thres.gain=0.2, thres.loss=-0.1)
Error in `[<-.data.frame`(`*tmp*`, z$chr == chrom[j], 2 + i, value = numeric(0)) : 
  replacement has length zero

Do you perhaps have some other excellent suggestions? Thanks!

ADD REPLY • link 6.5 years ago by lien ▴ 90

0

Entering edit mode

Yes, these issues appear sometimes, and they can be very annoying.

What I believe is the next logical step is to look at the code for the plotFreq function, and to see where (and why) it throws that particular error. To view the code for any function in R, just type the name at the command line (e.g. type plotFreq).

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

The code doesn't seem too difficult, but still too difficult for me. It seems I still don't understand why it is throwing me this error...

> plotFreq
function (segments, thres.gain, thres.loss = -thres.gain, pos.unit = "bp", 
    chrom = NULL, layout = c(1, 1), ...) 
{
    data = segments
    if (is.data.frame(data)) {
        data = getFreqData(data)
    }
    else {
        data <- pullOutContent(res = data, what = "estimates")
    }
    stopifnot(ncol(data) >= 3)
    data[, 1] <- numericChrom(data[, 1])
    type <- ifelse(is.null(chrom), "genome", "bychrom")
    chrom <- checkChrom(data = data, segments = NULL, chrom = chrom)
    if (!pos.unit %in% c("bp", "kbp", "mbp")) {
        stop("pos.unit must be one of bp, kbp and mbp", call. = FALSE)
    }
    nT <- min(length(thres.gain), length(thres.loss))
    thres.gain <- thres.gain[1:nT]
    thres.loss <- thres.loss[1:nT]
    switch(type, genome = genomeFreq(data, thres.gain, thres.loss, 
        pos.unit, layout, ...), bychrom = chromosomeFreq(data, 
        thres.gain, thres.loss, pos.unit, chrom, layout, ...))
}
<environment: namespace:copynumber>

Should I assume that there is still something missing/wrong in my input file as it says that there is an error in the data.frame?

ADD REPLY • link 6.5 years ago by lien ▴ 90

0

Entering edit mode

It is impossible to solve without having a look at your whole dataframe, I fear. If you can, try uploading it somewhere and sharing the link here. I think it could be an issue with your chromosome names: check if you have some observations for each chromosome, or if some are missing (sort(unique(a$chr)) == sort(unique(lymphoma.res$chr)) should tell you if you have the same chromosome IDs in the 2 datasets, if not you can look at the elements which differ).

ADD REPLY • link 6.5 years ago by Selenocysteine ▴ 620

0

Entering edit mode

Selenocysteine's ideas will help. You should follow up on those too.

I believe that the error lies in the code:

data = segments
    if (is.data.frame(data)) {
        data = getFreqData(data)
    }
    else {
        data <- pullOutContent(res = data, what = "estimates")
    }

getFreqData and pullOutContent are two inner functions of the package, which will also have their own code.

In order to know which one is being called, run is.data.frame(amut)

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

We actually know that amut is a dataframe from the str() output they posted above. I think it might also be something here:

chrom <- checkChrom(data = data, segments = NULL, chrom = chrom)

But the only way to know is to have access to the actual table I guess.

ADD REPLY • link 6.5 years ago by Selenocysteine ▴ 620

0

Entering edit mode

I checked the chromosome names and these should be ok: the output was the same when using sort(unique(amut$chr)) and sort(unique(lymphoma.res$chr)). Indeed amut is a data.frame, as well as lymphoma.res.

You can find the link to my test file here: https://github.com/Lien-NGSdummie/R-copynumber/issues/1

ADD REPLY • link 6.5 years ago by lien ▴ 90

1

Entering edit mode

I have isolated the error to the interpolate.pcf() function call, which is called from getFreqData(data) in your above code.

Here is the code for the getFreqData() function: https://rdrr.io/github/mathieu-lemire/copynumber38/src/R/getFreqData.r

I am now trying to understand why it throws the error.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

You'd be surprised. The name-spacing for variable types in R needs to improve. I've seen data-frames that fail the is.data.frame() function call, even when I produce them with data.frame()

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

Yes indeed :)

Anyway I think the issue here is that lien's dataframe does not have data for each sample for every chromosome (while lymphoma.res does). For this reason, some combinations of sampleID and chr give an empty result, and this gets the function stuck. I think this should be fixable with an if-else inside interpolate.pcf()?

(see: ggplot(ares) + geom_point(aes(x=sampleID, y=as.factor(chrom))) vs ggplot(lymphoma.res) + geom_point(aes(x=sampleID, y=as.factor(chrom))) )

ADD REPLY • link 6.5 years ago by Selenocysteine ▴ 620

0

Entering edit mode

Could be! - thanks for your contribution.

Just for the OP: I have been able to successfully plot each chromosome with the following code:

plotFreq(segments=a[a$chrom=="1",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="2",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="3",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="4",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="5",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="6",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="7",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="8",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="9",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="10",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="11",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="12",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="13",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="14",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="15",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="16",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="17",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="18",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="19",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="20",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="21",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="22",], thres.gain=0.2, thres.loss=-0.1)
plotFreq(segments=a[a$chrom=="23",], thres.gain=0.2, thres.loss=-0.1)

However, when you try to run it all together, the error is thrown in interpolate.pcf() in the loop:

for (i in 1:nsamp)
{
        for (j in 1:length(chrom))
    {
            fitij = segments[segments$sampleID == usamp[i] & 
                segments$chrom == chrom[j], ]
            v = (c(fitij$start.pos[-1], 10^9) + fitij$end.pos)/2
            xj = bpts[bpts[, 1] == chrom[j], 2]
            kj = rep(0, length(xj))
            for (k in rev(1:length(v))) {
                kj[xj <= v[k]] = k
            }
            z[z$chr == chrom[j], 2 + i] = fitij$mean[kj]
        }
    }

It stops at i=1 and j=15, i.e., chromosome 15. I don't know why.

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

1

Entering edit mode

It is exactly for the reason I stated above. The function does:

for (i in 1:nsamp) {
    for (j in 1:length(chrom)) {
      fitij = segments[segments$sampleID==usamp[i] & segments$chrom==chrom[j],]
}

The first sampleID (i = 1) is "GC041544.AR001". The first chromosome for which you don't have observations for GC041544.AR001 is chromosome 15 (j = 15). When you do it chr by chr it works because if you subset by taking a single chromosome you will have data for all the sampleID from that chromosome. :)

ADD REPLY • link 6.5 years ago by Selenocysteine ▴ 620

0

Entering edit mode

Good spot. interpolate.pcf() should have been able to deal with such a simple error like that, though. Like you said, a simple if / ifelse

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

0

Entering edit mode

I am not sure, I just tried adding the if and it is still throwing an error at the end:

Error in `[.getAnywhere`(cyto.data, , 1) : unused argument (1)

:/

ADD REPLY • link 6.5 years ago by Selenocysteine ▴ 620

1

Entering edit mode

Ok this can be fixed by changing the interpolate.pcf() function and adding this if clause:

  for (i in 1:nsamp) {
for (j in 1:length(chrom)) {
  fitij = segments[segments$sampleID==usamp[i] & segments$chrom==chrom[j],]
  if(nrow(fitij) > 0){ # added line to ignore cases in which we have no observations
    v = (c(fitij$start.pos[-1],10^9)+fitij$end.pos)/2
    xj = x[x[,1]==chrom[j],2]
    kj = rep(0,length(xj))
    for (k in rev(1:length(v))) {
      kj[xj <= v[k]] = k
    }
    z[z$chr==chrom[j],2+i] = fitij$mean[kj]
  }
}

}

This is the resulting plot. Plot

Does this look as it should be to you lien?

I will try to add the bugfix on their github. [edit] this is the edited version: https://github.com/selenocysteine/copynumber

Thanks Kevin for spotting the exact line which was throwing the error!!

ADD REPLY • link updated 6.5 years ago by GenoMax 141k • written 6.5 years ago by Selenocysteine ▴ 620

0

Entering edit mode

Hi I cannot see the resulting plot. But I assume it will look ok. Could you help me out a little bit more on how to use your code? How can I integrate it within the copynumber package in R? Sorry for this probably basic question.

ADD REPLY • link 6.5 years ago by lien ▴ 90

2

Entering edit mode

If you are using linux, it is pretty straightforward. There are multiple ways, I don't know all of them because I usually don't do this :D

Anyway, from the terminal:

1) create a directory anywhere on your computer and clone the git repos:

git clone https://github.com/selenocysteine/copynumber.git

(or download the ZIP from the website https://github.com/selenocysteine/copynumber and unzip it somewhere if you don't have git)

2) cd into the parent directory of the unzipped folder of the repository (e.g. /home/lien/Desktop/ if the folder is /home/lien/Desktop/copynumber)

3) type and enter: R CMD build copynumber

it will create a tar.gz file for the packages.

4) type and enter:

R CMD INSTALL "name_of_the_tar.gz_file"

[if it gives you an error, you have to run it with SUDO because your user is likely not to have access to the R packages directory]

5) from R:

library("copynumber")

should run the "updated" version

If you are using windows, try to look at "installing downloaded R packages" I guess?

Don't worry it is not such a basic question :D

ADD REPLY • link 6.5 years ago by Selenocysteine ▴ 620

1

Entering edit mode

You can also over-ride the current package function with your own function, as follows:

interpolate.pcf.new <- function (segments, x) 
{
    segments <- pullOutContent(res = segments, what = "segments")
    usamp = unique(segments$sampleID)
    nsamp = length(usamp)
    chrom = unique(x[, 1])
    z = data.frame(cbind(x[, c(1:2)], matrix(0, nrow(x), nsamp)))
    names(z) = c("chr", "pos", usamp)

    for (i in 1:nsamp)
    {
        for (j in 1:length(chrom))
        {
            fitij = segments[segments$sampleID==usamp[i] & segments$chrom==chrom[j],]

            if(nrow(fitij) > 0)
            { # added line to ignore cases in which we have no observations
                v = (c(fitij$start.pos[-1],10^9)+fitij$end.pos)/2
                xj = x[x[,1]==chrom[j],2]
                kj = rep(0,length(xj))

                for (k in rev(1:length(v)))
                {
                    kj[xj <= v[k]] = k
                }

                z[z$chr==chrom[j],2+i] = fitij$mean[kj]
            }
        }
    }

    return(z)
}

assignInNamespace("interpolate.pcf", interpolate.pcf.new, ns="copynumber")

plotFreq(segments=a, thres.gain=0.2, thres.loss=-0.1)

For this, you need to also load in the pullOutContent function though.

Good teamwork Selenocysteine! ;)

ADD REPLY • link 6.5 years ago by Kevin Blighe 87k

1

Entering edit mode

Irish bioinformatic collaboration at its best :D

ADD REPLY • link 6.5 years ago by Selenocysteine ▴ 620

2

Entering edit mode

Thanks a lot Kevin Blighe and Selenocysteine for all your work! It works perfectly now!

ADD REPLY • link 6.5 years ago by lien ▴ 90

0

Entering edit mode

Click on the "Plot" word above to see the plot. If your local firewall rules block free image hosting sites then you may not be able to see the plot.

ADD REPLY • link 6.5 years ago by GenoMax 141k

score 0 · Answer 1 · 2017-10-25

I see no issue in your code. My input text was copy/pasted from OP.

> test=read.csv("test.txt", sep="", header = T)

.

> test
        sampleID chrom arm start.pos  end.pos n.probes   mean
1  GC041544AR001     1   p    887078  1920182       10  3.658
2  GC043725AR013     1   p   1282158  2340523       10  4.884
3  GC041544AR001     1   p   2694253  7633559       10  5.635
4  GC041544AR001     1   p   7633560 10932033       10 12.726
5  GC041544AR001     1   p  11507192 14281220       10 24.825
6  GC041544AR001     1   p  15339635 16417326       10 14.902
7  GC041544AR001     1   p  17084389 18335685       10  8.939
8  GC041544AR001     1   p  18335686 21220367       10 25.313
9  GC041544AR001     1   p  21220368 25194985       10 16.479
10 GC041544AR001     1   p  25194986 26836784       10 14.579

.

> class(test)
[1] "data.frame"

.

> str(test)
'data.frame':   10 obs. of  7 variables:
 $ sampleID : Factor w/ 2 levels "GC041544AR001",..: 1 2 1 1 1 1 1 1 1 1
 $ chrom    : int  1 1 1 1 1 1 1 1 1 1
 $ arm      : Factor w/ 1 level "p": 1 1 1 1 1 1 1 1 1 1
 $ start.pos: int  887078 1282158 2694253 7633560 11507192 15339635 17084389 18335686 21220368 25194986
 $ end.pos  : int  1920182 2340523 7633559 10932033 14281220 16417326 18335685 21220367 25194985 26836784
 $ n.probes : int  10 10 10 10 10 10 10 10 10 10
 $ mean     : num  3.66 4.88 5.63 12.73 24.82 ...

.

> plotFreq(segments=test, thres.gain=0.2, thres.loss=-0.1)]

.

> sessionInfo()
R version 3.4.2 (2017-09-28)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Linux Mint 18.2

Matrix products: default
BLAS: /usr/lib/openblas-base/libblas.so.3
LAPACK: /usr/lib/libopenblasp-r0.2.18.so

locale:
 [1] LC_CTYPE=en_IN.UTF-8       LC_NUMERIC=C               LC_TIME=en_IN.UTF-8       
 [4] LC_COLLATE=en_IN.UTF-8     LC_MONETARY=en_IN.UTF-8    LC_MESSAGES=en_IN.UTF-8   
 [7] LC_PAPER=en_IN.UTF-8       LC_NAME=C                  LC_ADDRESS=C              
[10] LC_TELEPHONE=C             LC_MEASUREMENT=en_IN.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] copynumber_1.16.0   BiocGenerics_0.22.1

loaded via a namespace (and not attached):
 [1] zlibbioc_1.22.0         compiler_3.4.2          IRanges_2.10.5         
 [4] XVector_0.16.0          tools_3.4.2             GenomicRanges_1.28.6   
 [7] GenomeInfoDbData_0.99.0 RCurl_1.96-0            yaml_2.1.14            
[10] S4Vectors_0.14.7        GenomeInfoDb_1.12.3     bitops_1.0-6           
[13] stats4_3.4.2  
  [1]: https://ibb.co/k5FCFm