Question: How to extract unique values between columns in R ?
0
gravatar for k.kathirvel93
4 months ago by
k.kathirvel93190
India
k.kathirvel93190 wrote:

I want to extract (as a file) only unique values from 6 columns( 1st column vs (2-6) column) in a data frame. The unique value from one column that does not present any other column. Thanks in advance.

    A         B         C         D          E       F

   12         15        18        55         27      13
   15         25        10        21         23      20
   20         18        14        25         15      25
   25         27        30        35         25      10
                        35                           15

The output should looks like this...

   A          B         C         D         E          F
  12          -         14        55        23        13
                        30        21
R gene • 339 views
ADD COMMENTlink modified 4 months ago by jomo018470 • written 4 months ago by k.kathirvel93190

Could you please elaborate the expected output, because it is bit confusing.

None of the record in output is maintaining the order as input, neither rowise nor columnwise.

ADD REPLYlink written 4 months ago by Nitin Narwade380

Sorry for the confusion i have modified the post now. Thanks.

ADD REPLYlink modified 4 months ago • written 4 months ago by k.kathirvel93190

Your output isn't a data-frame, it has different vector lengths for each entry. Could you confirm why 13 is in the output for B given that it was absent from the input; and why 35 was filtered out of column C given that it is absent from column A. You might be better to generate a data.frame with a column containinng the unique numbers, and a column indicating whether the number was observed in A .

ADD REPLYlink written 4 months ago by russhh4.2k
1
gravatar for jomo018
4 months ago by
jomo018470
jomo018470 wrote:

Assuming the input structure is cast-ed into a data frame (df) by filling NA.

uu = unlist(df)[ ! duplicated(unlist(df),fromLast=T)  &  ! duplicated(unlist(df))]

apply(df,2,FUN=function(x){r=rep(NA,nrow(df)) ; d=duplicated(c(x,uu),fromLast=T)[1:nrow(df)] ; r[d]=x[d] ; return( r[order(r)] )})

uu is a vector with unique values (exclusive unique as defined by OP). This requires !duplicated from left-to-right AND right-to-left.

apply returns items from each column which are "legal" (appear in uu). The final order(r) ensures NA are pushed down in each column of the resulting data frame.

ADD COMMENTlink modified 4 months ago • written 4 months ago by jomo018470
0
gravatar for Alex Reynolds
4 months ago by
Alex Reynolds27k
Seattle, WA USA
Alex Reynolds27k wrote:

I think a problem is that the answer to this is not unique. For instance, this is also a "correct" answer based on your criteria:

A       B       C       D       E       F
12      15      10      55      27      20
        25      14      21      23      13
                18      35
                30

What gets filtered can depend on the order in which elements are added to a set (and subsequently tested for membership).

In any case, with that caveat, here is a Python script you could use to possibly generate this kind of result:

#!/usr/bin/env python

import sys

lidx = 0
headers = []
allValues = set()
perColValues = None

# read input into sets
for line in sys.stdin:
    elems = line.rstrip().split('\t')
    if lidx == 0:
        headers = elems
        perColValues = { x: set() for x in headers }
    else:
        for elemIdx, elem in enumerate(elems):
            if len(elem) == 0:
                continue
            header = headers[elemIdx]
            if elem not in allValues:
                perColValues[header].add(elem)
            allValues.add(elem)
    lidx += 1

# pad set with blanks
maxCount = 0
for header in headers:
    l = len(perColValues[header])
    if maxCount < l:
        maxCount = l
paddedPerColValues = { x: [] for x in headers }
for header in headers:
    l = len(perColValues[header])
    paddedPerColValues[header].extend(list(perColValues[header]))
    paddedPerColValues[header].extend([''] * (maxCount - l))

# write output
sys.stdout.write('%s\n' % ('\t'.join(headers)))
for lidx in range(maxCount):
    sys.stdout.write('%s\n' % ('\t'.join([paddedPerColValues[x][lidx] for x in headers])))

Usage:

$ python uniquify.py < in.mtx > out.mtx
ADD COMMENTlink modified 4 months ago • written 4 months ago by Alex Reynolds27k
0
gravatar for cpad0112
4 months ago by
cpad011211k
India
cpad011211k wrote:
df1 %>% 
    gather(k,v) %>%
    mutate(k=as.factor(k))%>%
    na.omit() %>%
    group_by(v) %>%
    filter(n() == 1) %>%    
    group_by(k)%>%
    mutate(g = row_number()) %>%
    spread(k,v,drop = F, fill="") %>%
    select(-g) %>%
    as.data.frame()

   A B  C  D  E  F
1 12   14 55 23 13
2      30 21
ADD COMMENTlink modified 4 months ago • written 4 months ago by cpad011211k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1361 users visited in the last hour