Question

duplicates issues when trying to convert long to wide in R

0

Entering edit mode

5 months ago

happyday1661 • 0

library(dplyr)
library(tibble)
library(tidyr)

df <- test %>%
  mutate(row_id = model_name) %>%
  pivot_wider(names_from = gene_symbol, values_from = fpkm) ### 
Warning message:
Values from `fpkm` are not uniquely identified; output will contain list-cols.
• Use `values_fn = list` to suppress this warning.
• Use `values_fn = {summary_fun}` to summarise duplicates.
• Use the following dplyr code to identify duplicates.
  {data} %>%
  dplyr::group_by(model_name, row_id, gene_symbol) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L)

The above works, but df$gene is a list ; not as the normal data frame;

df$geneA

[[911]]
[1] 0.32

[[912]]
[1] 0.3

[[913]]
[1] 0.14

[[914]]
[1] 0.22

[[915]]
[1] 0.31

[[916]]
[1] 0

[[917]]
[1] 0.04

enter image description here

Thank you very much for all your guidance!

input data download the board cell line RNAseq data https://cellmodelpassports.sanger.ac.uk/downloads

R • 879 views

ADD COMMENT • link 5 months ago by happyday1661 • 0

1

Entering edit mode

There are likely duplicate values of gene_symbol. Run the code provided in the warning message to check if this is the case.

df %>%
  dplyr::group_by(model_name, row_id, gene_symbol) %>%
  dplyr::summarise(n = dplyr::n(), .groups = "drop") %>%
  dplyr::filter(n > 1L)

If there are duplicates you need to decide whether duplicates should be removed or made unique (by for example adding a suffix to each one).

ADD REPLY • link 5 months ago by rpolicastro 13k

0

Entering edit mode

Thank you very much for all your guidance!

It is indeed likely to be the gene_symbol "duplicates" issues. Say model_name1 have 20,000 genes detected, model_name2 have 22,000 genes detected, when transform the data frame, it may not automatically use the union of 20,000 genes + 22,000 genes...

All the best!

ADD REPLY • link 5 months ago by happyday1661 • 0

1

Entering edit mode

Gene names are a mess. My advise is to use geneID_geneName so Ensembl gene ID then then an underscore followed by the gene name. That is guaranteed to be unique, and if for plotting you need gene name you just do a quick regex to take the part trailing the underscore.

ADD REPLY • link 5 months ago by ATpoint 82k

0

Entering edit mode

Thank you very much! Very helpful guidance!Appreciate!

ADD REPLY • link 5 months ago by happyday1661 • 0

1

Entering edit mode

How many models do you have in this data? Creating wider table for each model separately could be another option to not to deal with the duplicate gene IDs.

ADD REPLY • link 5 months ago by bk11 ★ 2.4k

0

Entering edit mode

Thanks a lot!

It is 925 model_name, each with near 32,000 genes. I used the unique() to get the unique set, even there are warnings about the duplicates for fpkm values, I convert the list to normal values use unname(unlist(x)) function to df$genes, and it seems works...

Thank you very much! Appreciate!

ADD REPLY • link 5 months ago by happyday1661 • 0

score 1 · Answer 1 · 2023-11-09

1

Entering edit mode

5 months ago

bk11 ★ 2.4k

You can use reshape package to do it easily.

library(reshape2)

test_data=read.table("test_data.txt", h=T, sep="\t")
head(test_data)
  model_name gene_symbol fpkm
1      22RV1        A1BG 0.12
2      22RV1    A1BG-AS1 1.40
3      22RV1        A1CF 5.28
4      22RV1         A2M 1.57
5      22RV1     A2M-AS1 1.29
6      22RV1       A2ML1 0.11

reshape(test_data, idvar = "model_name", timevar = "gene_symbol", direction = "wide")

model_name fpkm.A1BG fpkm.A1BG-AS1 fpkm.A1CF fpkm.A2M fpkm.A2M-AS1 fpkm.A2ML1
1      22RV1      0.12           1.4      5.28     1.57         1.29       0.11

ADD COMMENT • link 5 months ago by bk11 ★ 2.4k

1

Entering edit mode

This doesn't address the original question about duplicate values, and the reshape2 library is generally deprecated in favor of pivot_wider and pivot_longer from tidyr.

ADD REPLY • link 5 months ago by rpolicastro 13k