Question: GenomicDataCommons request timeouts cases() %>% ... %>% results_all()
1
gravatar for mk
28 days ago by
mk80
mk80 wrote:

I've been experimenting with the GenomicDataCommons package to handle query work against the GDC API. For some reason there is an issue with timeouts for requests of a certain length. There doesn't seem to be a direct way around this using the piped syntax. Anyone else have luck with this?

There is a results(size = n) method, its syntax seems to allow only the first n records to be accessed.

Here is an example query (should return 400-500 records):

proj <- 'TCGA-COAD'
case_data <- cases() %>%
  GenomicDataCommons::filter(~ project.project_id == proj) %>%
  GenomicDataCommons::expand('diagnoses') %>%
  results_all() %>%
  as_tibble()

Gives, after a few moments:

Error in is.response(x) : Internal Server Error (HTTP 500).
ADD COMMENTlink modified 27 days ago by Sean Davis25k • written 28 days ago by mk80

It runs here but there's no output (?):

require(GenomicDataCommons)
require(tibble)
case_data <- cases() %>%
   GenomicDataCommons::filter(~ project.project_id == proj) %>%
   GenomicDataCommons::expand('diagnoses') %>%
   results_all() %>%
   as_tibble()
case_data
# A tibble: 0 x 0


sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=pt_BR.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=pt_BR.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tibble_2.0.1             GenomicDataCommons_1.6.0 magrittr_1.5            
[4] arm_1.10-1               lme4_1.1-19              Matrix_1.2-15           
[7] MASS_7.3-51.1
ADD REPLYlink written 28 days ago by Kevin Blighe37k

Thanks @Kevin Blighe, sloppy cut/paste job I forgot to add the project to the filter() method. Edited above.

ADD REPLYlink written 28 days ago by mk80
3
gravatar for Sean Davis
27 days ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

I should clean up the documentation, but results_all() is a convenience wrapper that is not too smart in that it simply tries to return all results in one trip to the server. This can fail for multiple reasons related to the size of result sets. The better approach (and the only one in the case of large results sets) is to page through the results:

proj <- 'TCGA-COAD'
query = cases() %>%
    GenomicDataCommons::filter(~ project.project_id == proj) %>%
    GenomicDataCommons::expand('diagnoses')
count = query %>% count()
size = 50
reslist = lapply(seq(1,count, size), function(start) {
    query %>% 
        results(size=size, from = start) %>%
        as_tibble()
})
case_data = bind_rows(reslist)

Unfortunately, the size parameter really requires trial-and-error to find the largest "working" setting since the results can vary quite significantly in volume. Instead, I usually just choose a smallish number like 50 or so and wait a few extra seconds. These calls can, in theory, be parallelized using something like BiocParallel to get really fancy (and introduce complexity).

ADD COMMENTlink modified 27 days ago • written 27 days ago by Sean Davis25k

@Sean Davis, thanks a lot for this info. Frankly this package is a life saver for me, since it addresses the need for general purpose GDC usage.

ADD REPLYlink written 14 days ago by mk80
0
gravatar for mk
28 days ago by
mk80
mk80 wrote:

I have a fix for this but it's not very satisfactory. Based on Sean Davis' blog post I retrieved sequencing data and extracted the case id's from that, then looped over the case ids fetching diagnosis data 50 records at a time. The fact that this worked is a mystery to me, since the query against the files() endpoint returns even more records than the query posted above (each case may contain multiple samples).

First get the files, and their associated cases:

proj <- 'TCGA-COAD'
tm_ge_files = files() %>%
GenomicDataCommons::filter(~   cases.samples.sample_type=='Primary Tumor' &
                           cases.project.project_id == proj &
                           analysis.workflow_type == "HTSeq - Counts") %>%
expand(c('cases','cases.samples')) %>%
results_all() %>%
as_tibble()
tm_cases = bind_rows(tm_ge_files$cases, .id='file_id')

Now get the diagnoses:

left <- 1
right <- min(50, length(tm_cases$case_id))
clin <- gdc_clinical(tm_cases$case_id[left:right])$diagnoses
while(left < length(tm_cases$case_id)){
  left <- min((left + 50), length(tm_cases$case_id))
  right <- min((right + 50), length(tm_cases$case_id))
  clin <- rbind(clin,gdc_clinical(tm_cases$case_id[left:right])$diagnoses)
}
ADD COMMENTlink modified 28 days ago • written 28 days ago by mk80
1

Yes, I was just about to say that I reproduced the error and that you should report on the Bioconductor forum (and link back to this thread), where Sean Davis may pick it up more quickly: https://support.bioconductor.org/t/Latest/

ADD REPLYlink written 28 days ago by Kevin Blighe37k
1

Ok, I threw up a link on Bioconductor forum. In case the answer gets posted there I'll update this thread.

ADD REPLYlink written 28 days ago by mk80

Thanks - I'm a user there too but much less reputation score: https://support.bioconductor.org/u/16406/

ADD REPLYlink written 28 days ago by Kevin Blighe37k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2044 users visited in the last hour