Question: GenomicDataCommons request timeouts cases() %>% ... %>% results_all()
1
gravatar for mk
5 months ago by
mk90
mk90 wrote:

I've been experimenting with the GenomicDataCommons package to handle query work against the GDC API. For some reason there is an issue with timeouts for requests of a certain length. There doesn't seem to be a direct way around this using the piped syntax. Anyone else have luck with this?

There is a results(size = n) method, its syntax seems to allow only the first n records to be accessed.

Here is an example query (should return 400-500 records):

proj <- 'TCGA-COAD'
case_data <- cases() %>%
  GenomicDataCommons::filter(~ project.project_id == proj) %>%
  GenomicDataCommons::expand('diagnoses') %>%
  results_all() %>%
  as_tibble()

Gives, after a few moments:

Error in is.response(x) : Internal Server Error (HTTP 500).
ADD COMMENTlink modified 5 months ago by Sean Davis25k • written 5 months ago by mk90

It runs here but there's no output (?):

require(GenomicDataCommons)
require(tibble)
case_data <- cases() %>%
   GenomicDataCommons::filter(~ project.project_id == proj) %>%
   GenomicDataCommons::expand('diagnoses') %>%
   results_all() %>%
   as_tibble()
case_data
# A tibble: 0 x 0


sessionInfo()
R version 3.5.2 (2018-12-20)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 16.04.5 LTS

Matrix products: default
BLAS: /usr/lib/atlas-base/atlas/libblas.so.3.0
LAPACK: /usr/lib/atlas-base/atlas/liblapack.so.3.0

locale:
 [1] LC_CTYPE=pt_BR.UTF-8       LC_NUMERIC=C              
 [3] LC_TIME=en_GB.UTF-8        LC_COLLATE=pt_BR.UTF-8    
 [5] LC_MONETARY=en_GB.UTF-8    LC_MESSAGES=pt_BR.UTF-8   
 [7] LC_PAPER=en_GB.UTF-8       LC_NAME=C                 
 [9] LC_ADDRESS=C               LC_TELEPHONE=C            
[11] LC_MEASUREMENT=en_GB.UTF-8 LC_IDENTIFICATION=C       

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] tibble_2.0.1             GenomicDataCommons_1.6.0 magrittr_1.5            
[4] arm_1.10-1               lme4_1.1-19              Matrix_1.2-15           
[7] MASS_7.3-51.1
ADD REPLYlink written 5 months ago by Kevin Blighe45k

Thanks @Kevin Blighe, sloppy cut/paste job I forgot to add the project to the filter() method. Edited above.

ADD REPLYlink written 5 months ago by mk90
3
gravatar for Sean Davis
5 months ago by
Sean Davis25k
National Institutes of Health, Bethesda, MD
Sean Davis25k wrote:

I should clean up the documentation, but results_all() is a convenience wrapper that is not too smart in that it simply tries to return all results in one trip to the server. This can fail for multiple reasons related to the size of result sets. The better approach (and the only one in the case of large results sets) is to page through the results:

proj <- 'TCGA-COAD'
query = cases() %>%
    GenomicDataCommons::filter(~ project.project_id == proj) %>%
    GenomicDataCommons::expand('diagnoses')
count = query %>% count()
size = 50
reslist = lapply(seq(1,count, size), function(start) {
    query %>% 
        results(size=size, from = start) %>%
        as_tibble()
})
case_data = bind_rows(reslist)

Unfortunately, the size parameter really requires trial-and-error to find the largest "working" setting since the results can vary quite significantly in volume. Instead, I usually just choose a smallish number like 50 or so and wait a few extra seconds. These calls can, in theory, be parallelized using something like BiocParallel to get really fancy (and introduce complexity).

ADD COMMENTlink modified 5 months ago • written 5 months ago by Sean Davis25k

@Sean Davis, thanks a lot for this info. Frankly this package is a life saver for me, since it addresses the need for general purpose GDC usage.

ADD REPLYlink written 5 months ago by mk90
0
gravatar for mk
5 months ago by
mk90
mk90 wrote:

I have a fix for this but it's not very satisfactory. Based on Sean Davis' blog post I retrieved sequencing data and extracted the case id's from that, then looped over the case ids fetching diagnosis data 50 records at a time. The fact that this worked is a mystery to me, since the query against the files() endpoint returns even more records than the query posted above (each case may contain multiple samples).

First get the files, and their associated cases:

proj <- 'TCGA-COAD'
tm_ge_files = files() %>%
GenomicDataCommons::filter(~   cases.samples.sample_type=='Primary Tumor' &
                           cases.project.project_id == proj &
                           analysis.workflow_type == "HTSeq - Counts") %>%
expand(c('cases','cases.samples')) %>%
results_all() %>%
as_tibble()
tm_cases = bind_rows(tm_ge_files$cases, .id='file_id')

Now get the diagnoses:

left <- 1
right <- min(50, length(tm_cases$case_id))
clin <- gdc_clinical(tm_cases$case_id[left:right])$diagnoses
while(left < length(tm_cases$case_id)){
  left <- min((left + 50), length(tm_cases$case_id))
  right <- min((right + 50), length(tm_cases$case_id))
  clin <- rbind(clin,gdc_clinical(tm_cases$case_id[left:right])$diagnoses)
}
ADD COMMENTlink modified 5 months ago • written 5 months ago by mk90
1

Yes, I was just about to say that I reproduced the error and that you should report on the Bioconductor forum (and link back to this thread), where Sean Davis may pick it up more quickly: https://support.bioconductor.org/t/Latest/

ADD REPLYlink written 5 months ago by Kevin Blighe45k
1

Ok, I threw up a link on Bioconductor forum. In case the answer gets posted there I'll update this thread.

ADD REPLYlink written 5 months ago by mk90

Thanks - I'm a user there too but much less reputation score: https://support.bioconductor.org/u/16406/

ADD REPLYlink written 5 months ago by Kevin Blighe45k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2013 users visited in the last hour