Hi community,
I have a problem trying to create a matrix using an expression data (downloaded)
My data (xx) looks like this :
sample_id raw_read_count gene_id normalized_read_count
CPCG0402-F1 2 "DDX11L1" 0.00680125380953093
CPCG0402-F1 157 "WASH7P" 1.50386916339037
CPCG0402-F1 0 "RP11-34P13.3" 0
CPCG0402-F1 0 "FAM138A" 0
CPCG0402-F1 0 "OR4G4P" 0
CPCG0402-F 10 "OR4G11P" 0
Someone suggested to convert my table into a matrix using this code
mat <- xx %>%
select(!normalized_read_count) %>%
pivot_wider(names_from=sample_id, values_from=raw_read_count) %>%
column_to_rownames("gene_id") %>%
as.matrix
which works perfectly for other data set, but when I'm trying to run using this data I get the warning message :
Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates
and of course some the output contains list-cols.
I've tried unique(), distinct() but it doesn't work.
I'm also trying to transform "hgnc_symbol" to ensembl_gene_id using biomart. But it doesn't make any difference. Any suggestion? Thanks!!
Test the dplyr pipeline step by step. Where do the duplicates lie?
in the
pivot_widerTry adding
id_cols=gene_idto the function.it does not work, but thank you for the suggestion
Try adding
names_repair="unique"topivot_widerand then comparexx$sample_idwithcolnames(mat)to see what's going on. Maybe multiple different delimiters are being cleaned to.and get treated as duplicates.I found the problem but I don't know how to solve it. There are some
gene_idthat appear duplicated in the same patient. As exampleWASH7Ptake value 157 and 207 for the patientCPCG0402-F1, so can I solve this? any clue?You'll need to go back to where this data came from, because this sounds like an identifier mapping problem. Maybe these entries had different ENSG identifiers, one of which is in a canonical chromosome and the other(s) in patches/alt contigs.
Yes , I know what is the problem, but there is nothing that I can do because this is the ibky data for the study.