Question

How to work with big data from ICGC

1

Entering edit mode

3.3 years ago

kinalimeric ▴ 40

Hi all,

I have downloaded 56G data from ICGC and I am having trouble analyzing it on R (Although I have access to a server). I know that ICGC data is stored in GDC. However, to analyze big data on R, I see suggestions like using SQL through dplyr/R.

Do you know any way to do that? And I would like to know how do you work with big data from databases such as ICGC and TCGA when you need to use R.

I downloaded methylation data for 269 donors meth_array.PACA-AU.tsv.gz from here using wget:

https://dcc.icgc.org/releases/current/Projects/PACA-AU

I want to compare mean methylation level of selected cg probes across patients.

I would appreciate any help!

icgc bigdata sql R • 1.2k views

ADD COMMENT • link updated 3.3 years ago by zx8754 11k • written 3.3 years ago by kinalimeric ▴ 40

score 3 · Accepted Answer · 2021-01-06

3

Entering edit mode

3.3 years ago

zx8754 11k

We don't need full dataset in R, subset before importing, for example using data.table::fread + bash (not tested):

# use head meth_array.PACA-AU.tsv to find out which columns we need
# here we need columns 4 and 9
# 4=icgc_sample_id
# 6=probe_id
# 9=methylation_value

# read only columns 4 and 9 for probe cg00000029
x <- fread("grep -E '^cg00000029$' meth_array.PACA-AU.tsv | cut -f4,9")

# then aggregate as usual
x[ , .(myMean = mean(methylation_value)), by = icgc_sample_id]