Question

problem when create an expression matrix using a table

0

Entering edit mode

3.3 years ago

Lila M ★ 1.2k

Hi community, I have a problem trying to create a matrix using an expression data (downloaded) My data (xx) looks like this :

     sample_id    raw_read_count     gene_id    normalized_read_count
    CPCG0402-F1        2            "DDX11L1"   0.00680125380953093
    CPCG0402-F1       157           "WASH7P"    1.50386916339037
    CPCG0402-F1        0           "RP11-34P13.3"   0
    CPCG0402-F1        0            "FAM138A"       0
    CPCG0402-F1        0            "OR4G4P"        0
    CPCG0402-F         10           "OR4G11P"       0

Someone suggested to convert my table into a matrix using this code

mat <- xx %>%
    select(!normalized_read_count) %>%
    pivot_wider(names_from=sample_id, values_from=raw_read_count) %>%
  column_to_rownames("gene_id") %>%
  as.matrix

which works perfectly for other data set, but when I'm trying to run using this data I get the warning message :

Warning message:
Values are not uniquely identified; output will contain list-cols.
* Use `values_fn = list` to suppress this warning.
* Use `values_fn = length` to identify where the duplicates arise
* Use `values_fn = {summary_fun}` to summarise duplicates

and of course some the output contains list-cols. I've tried unique(), distinct() but it doesn't work. I'm also trying to transform "hgnc_symbol" to ensembl_gene_id using biomart. But it doesn't make any difference. Any suggestion? Thanks!!

RNA-Seq software error R expression • 1.3k views

ADD COMMENT • link updated 3.3 years ago by rpolicastro 13k • written 3.3 years ago by Lila M ★ 1.2k

0

Entering edit mode

Test the dplyr pipeline step by step. Where do the duplicates lie?

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

in the pivot_wider

mat <- xx %>% pivot_wider(names_from=sample_id, values_from=raw_read_count)

ADD REPLY • link 3.3 years ago by Lila M ★ 1.2k

0

Entering edit mode

Try adding id_cols=gene_id to the function.

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

it does not work, but thank you for the suggestion

ADD REPLY • link 3.3 years ago by Lila M ★ 1.2k

0

Entering edit mode

Try adding names_repair="unique" to pivot_wider and then compare xx$sample_id with colnames(mat) to see what's going on. Maybe multiple different delimiters are being cleaned to . and get treated as duplicates.

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

I found the problem but I don't know how to solve it. There are some gene_id that appear duplicated in the same patient. As example WASH7P take value 157 and 207 for the patient CPCG0402-F1, so can I solve this? any clue?

ADD REPLY • link 3.3 years ago by Lila M ★ 1.2k

1

Entering edit mode

You'll need to go back to where this data came from, because this sounds like an identifier mapping problem. Maybe these entries had different ENSG identifiers, one of which is in a canonical chromosome and the other(s) in patches/alt contigs.

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

Yes , I know what is the problem, but there is nothing that I can do because this is the ibky data for the study.

ADD REPLY • link 3.3 years ago by Lila M ★ 1.2k

Ram · Answer 1 · 2021-02-05

0

Entering edit mode

3.3 years ago

rpolicastro 13k

There could be duplicates because of errors in the analysis, or some mapping problem like multiple gene IDs mapping to the same gene name/symbol. If I were to guess it looks like your gene IDs are actually gene names, so is probably the latter problem.

If you can't figure out why there are duplicates, and if it's not possible to reanalyze the data with unique gene_ids as @Ram recommended, you can filter out any genes that appear more than once in your table. This is only a last resort effort though.

mat <- xx %>%
  select(!normalized_read_count) %>%
  group_by(gene_id) %>%
  filter(n() == 1) %>%
  ungroup %>%
  pivot_wider(names_from=sample_id, values_from=raw_read_count) %>%
  column_to_rownames("gene_id") %>%
  as.matrix

ADD COMMENT • link 3.3 years ago by rpolicastro 13k

0

Entering edit mode

Thank you for your comment. It is not possible because is a downloaded date and I don't have the raw sequences. Unfortunately, your answer doesn't work this time :(

ADD REPLY • link 3.3 years ago by Lila M ★ 1.2k

0

Entering edit mode

What do you mean by "it doesn't work"? It removes duplicates by the gene_id column so it should give you a result with no warnings. Do you mean to say that removing duplicates is not an acceptable trade-off?

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

It creates an empty table

ADD REPLY • link 3.3 years ago by Lila M ★ 1.2k

0

Entering edit mode

is there a duplicate for every gene? This will count the number of genes that are unique, duplicates, triplicates, etc.

gene_counts <- mat %>%
  add_count(gene_id) %>%
  count(n)

ADD REPLY • link updated 3.3 years ago by Ram 43k • written 3.3 years ago by rpolicastro 13k

0

Entering edit mode

Do you mean to operate on xx or mat?

ADD REPLY • link 3.3 years ago by Ram 43k

0

Entering edit mode

my bad, the original table is xx and mat is the matrix that I can't get

ADD REPLY • link 3.3 years ago by Lila M ★ 1.2k

0

Entering edit mode

No, there is not a duplication for every gene

ADD REPLY • link 3.3 years ago by Lila M ★ 1.2k

0

Entering edit mode

I edited the code in my post, can you try again?

ADD REPLY • link 3.3 years ago by rpolicastro 13k

0

Entering edit mode

yes but I get the same result :(

ADD REPLY • link 3.3 years ago by Lila M ★ 1.2k