Question: Finding number of duplicates in R
0
gravatar for nkabo
3.0 years ago by
nkabo10
nkabo10 wrote:

Hello, I have a list of gene names and several features, each row represents a gene and its specialities. There are approximately 15000 rows and 11 columns. Some of the genes are encountered more than once (for example there are 4 TP53 data ) and I want to see how many times the gene name is duplicated and I want to use that value. Duplicated gene names are one under the other. As an example: Gene name: rs_id: aa change: CASP7 xx yy TP53 zz hh TP53 ff cc TP53 bb gg WNT aa dd WNT qq kk

I want to find the number of duplicate for each gene (4 for TP53 and 2 for WNT) and I also want to check the aa change for each duplicate. Is there a way to do it in R? Thanks in advance.

R • 24k views
ADD COMMENTlink modified 3.0 years ago by keith.hughitt260 • written 3.0 years ago by nkabo10
1

You can try library plyr, see my post on bioconductor support site:

https://support.bioconductor.org/p/71837/#71839

ADD REPLYlink written 3.0 years ago by Benn7.5k

Thank you for your answer, I used the code below:

library(dplyr) newdf <- df %>% group_by(ID) %>% mutate(replicate=seq(n()))

However, I want to define one number only (for example, if a gene is repeated for 6 times, it should be like 6,6,6,6,6,6 not like 1,2,3,4,5,6). Could you suggest a way to do it?

ADD REPLYlink written 3.0 years ago by nkabo10
1

Try count function from plyr.

?count
ADD REPLYlink written 3.0 years ago by Benn7.5k
7
gravatar for keith.hughitt
3.0 years ago by
keith.hughitt260
United States
keith.hughitt260 wrote:

You can use the table function in R to get the count of each duplicated gene.

For example, if the gene IDs are stored in a column gene_id, you could do:

> dat <- data.frame(gene_id=sample(1:3, 20, replace=TRUE), other_col='foo')
> table(dat$gene_id)

1 2 3 
5 6 9 
> as.data.frame((table(dat$gene_id)))
  Var1 Freq
1    1    5
2    2    6
3    3    9

This gives you a data.frame of the number of duplicates for each ID.

Not sure what you mean by "check the aa change for each duplicate", but presumably you could just get a list of the unique gene IDs, and then use a for-loop to iterate over them, selecting all relevant rows, and performing some operation on each group of duplicates.

ADD COMMENTlink written 3.0 years ago by keith.hughitt260

Thank you for your answer I will also try that one.

ADD REPLYlink written 3.0 years ago by nkabo10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 616 users visited in the last hour