Duplicates in ICGC expression data
Entering edit mode
21 months ago
jack.henry ▴ 50

I am trying to read in some expression data from the ICGC, however I am having some trouble with duplicates.

Firstly I read in the data.

PACACASeq <- read.table("./CountMatrices/PACA_CA/exp_seq.tsv", sep = '\t', header = TRUE, stringsAsFactors = FALSE)

Get a table like this with counts, sample Ids and gene Ids.

enter image description here

I then use reshape2 to try to convert this into a count matrix like so:

PACACASeqCounts <- dcast(PACACASeq, gene_id ~ icgc_sample_id, value.var = "raw_read_count")

But this generates the notification

Aggregation function missing: defaulting to length

Which is resultant from there being duplicates of some sample ids/counts/gene names. I end up getting a matrix of 1's.

I was wondering if anyone has come into the same problem and how they sorted it.

Thanks in advance.

RNA-Seq ICGC • 491 views
Entering edit mode

Hi Jack,

We are working with the same data and we have found exactly the same problem. Did you solve it? If so, could you tell us how?

Thank you very much in advance.

Best regards,



Login before adding your answer.

Traffic: 1374 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6