Finding number of duplicates in R
Hello, I have a list of gene names and several features, each row represents a gene and its specialities. There are approximately 15000 rows and 11 columns. Some of the genes are encountered more than once (for example there are 4 TP53 data ) and I want to see how many times the gene name is duplicated and I want to use that value. Duplicated gene names are one under the other. As an example: Gene name: rs_id: aa change: CASP7 xx yy TP53 zz hh TP53 ff cc TP53 bb gg WNT aa dd WNT qq kk

I want to find the number of duplicate for each gene (4 for TP53 and 2 for WNT) and I also want to check the aa change for each duplicate. Is there a way to do it in R? Thanks in advance.

R • 49k views
You can try library plyr, see my post on bioconductor support site:

https://support.bioconductor.org/p/71837/#71839

library(dplyr) newdf <- df %>% group_by(ID) %>% mutate(replicate=seq(n()))

However, I want to define one number only (for example, if a gene is repeated for 6 times, it should be like 6,6,6,6,6,6 not like 1,2,3,4,5,6). Could you suggest a way to do it?

Try count function from plyr.

?count

You can use the table function in R to get the count of each duplicated gene.

For example, if the gene IDs are stored in a column gene_id, you could do:

> dat <- data.frame(gene_id=sample(1:3, 20, replace=TRUE), other_col='foo')
> table(dat$gene_id) 1 2 3 5 6 9 > as.data.frame((table(dat$gene_id)))
Var1 Freq
1    1    5
2    2    6
3    3    9


This gives you a data.frame of the number of duplicates for each ID.

Not sure what you mean by "check the aa change for each duplicate", but presumably you could just get a list of the unique gene IDs, and then use a for-loop to iterate over them, selecting all relevant rows, and performing some operation on each group of duplicates.

