How to work with big data from ICGC
9 months ago
kinalimeric ▴ 40

Hi all,

I have downloaded 56G data from ICGC and I am having trouble analyzing it on R (Although I have access to a server). I know that ICGC data is stored in GDC. However, to analyze big data on R, I see suggestions like using SQL through dplyr/R.

Do you know any way to do that? And I would like to know how do you work with big data from databases such as ICGC and TCGA when you need to use R.

I downloaded methylation data for 269 donors meth_array.PACA-AU.tsv.gz from here using wget:

I want to compare mean methylation level of selected cg probes across patients.

I would appreciate any help!

9 months ago
zx8754 10k

We don't need full dataset in R, subset before importing, for example using data.table::fread + bash (not tested):

# use head meth_array.PACA-AU.tsv to find out which columns we need
# here we need columns 4 and 9
# 4=icgc_sample_id
# 6=probe_id
# 9=methylation_value

# read only columns 4 and 9 for probe cg00000029
x <- fread("grep -E '^cg00000029$' meth_array.PACA-AU.tsv | cut -f4,9")

# then aggregate as usual
x[ , .(myMean = mean(methylation_value)), by = icgc_sample_id]
Thank you so much, this is really helpful and clear.


