Duplicates in ICGC expression data

0

Entering edit mode

4.8 years ago

jack.henry ▴ 50

I am trying to read in some expression data from the ICGC, however I am having some trouble with duplicates.

Firstly I read in the data.

PACACASeq <- read.table("./CountMatrices/PACA_CA/exp_seq.tsv", sep = '\t', header = TRUE, stringsAsFactors = FALSE)

Get a table like this with counts, sample Ids and gene Ids.

enter image description here

I then use reshape2 to try to convert this into a count matrix like so:

PACACASeqCounts <- dcast(PACACASeq, gene_id ~ icgc_sample_id, value.var = "raw_read_count")

But this generates the notification

Aggregation function missing: defaulting to length

Which is resultant from there being duplicates of some sample ids/counts/gene names. I end up getting a matrix of 1's.

I was wondering if anyone has come into the same problem and how they sorted it.

Thanks in advance.

RNA-Seq ICGC • 968 views

ADD COMMENT • link updated 3.9 years ago by ssabroso • 0 • written 4.8 years ago by jack.henry ▴ 50

0

Entering edit mode

Hi Jack,

We are working with the same data and we have found exactly the same problem. Did you solve it? If so, could you tell us how?