Question: How to remove duplicate rows and keep highest values only for a gene list with scores
0
gravatar for lessismore
22 months ago by
lessismore740
Mexico
lessismore740 wrote:

Dear all,

i have a two columns table of 20k lines. 1st column: list of gene IDs (there can be duplicated IDs)
2nd column: a value
What i want is to rank my list leaving with only unique gene IDs. For the duplicated gene IDs i want to leave only the ones with the highest score.

here an example, Thanks in advance

TMCS09g1008699  6.4
TMCS09g1008671  6.4
TMCS09g1008672  6.5
TMCS09g1008673  6
TMCS09g1008674  5.4
TMCS09g1008675  5.4
TMCS09g1008676  4.9
TMCS09g1008677  4.6
TMCS09g1008677  4.4
TMCS09g1008679  4.3
TMCS09g1008680  3.9
TMCS09g1008681  3.8
TMCS09g1008682  3.6
TMCS09g1008683  3.5
TMCS09g1008684  3.5
TMCS09g1008685  3.4
TMCS09g1008686  3.4
TMCS09g1008687  3.4
TMCS09g1008688  3
TMCS09g1008689  2.6
TMCS09g1008690  2
TMCS09g1008699  5.9
bash R • 4.6k views
ADD COMMENTlink modified 22 months ago by arta560 • written 22 months ago by lessismore740
1
gravatar for arta
22 months ago by
arta560
Sweden
arta560 wrote:

Suppose df is your data frame, id is first column and value is the second

df <- df[order(df$id, -abs(df$value) ), ] ### sort first
df <- df[ !duplicated(df$id), ]  ### Keep highest
ADD COMMENTlink modified 22 months ago • written 22 months ago by arta560

there's a mistake here, it takes the lowest

ADD REPLYlink modified 22 months ago • written 22 months ago by lessismore740

Can you try it again ?

ADD REPLYlink written 22 months ago by arta560

when i put it into a script in this way:

test = commandArgs(trailingOnly=TRUE)

read.delim(test, header = FALSE, sep ="\t")

test[order(test$V1, -abs(test$V2) ), ] ### sort first
test[ !duplicated(test$V1), ]  ### Keep highest

i get a partial output and then this error:

Error in test$V1 : $ operator is invalid for atomic vectors
Calls: order
Execution halted

do you know what that means?

ADD REPLYlink modified 22 months ago • written 22 months ago by lessismore740

Can you use test["V1"] instead of test$V1 ? You get this error because test$V1 is non-recursive object. You can find more info here.

ADD REPLYlink modified 22 months ago • written 22 months ago by arta560

i did, now i get this:

Error in abs(test["V2"]) : non-numeric argument to mathematical function
Calls: order
Execution halted
ADD REPLYlink written 22 months ago by lessismore740
0
gravatar for jared.andrews07
22 months ago by
jared.andrews074.3k
St. Louis, MO
jared.andrews074.3k wrote:

Can definitely be done in bash/awk, but it'd probably take me longer to figure out when it's easy enough with python.

import sys

in_file = sys.argv[1]
out_file = sys.argv[2]

genes = {}
with open(in_file) as f:
    for line in f:
        line = line.strip().split()
        gene_id = line[0]
        val = float(line[1])

        # Find and replace dups if necessary.
        if gene_id in genes:
            if val > genes[gene_id]:
                genes[gene_id] = val
        else:
            genes[gene_id] = val

out = open(out_file, "w")

# Actually print to output.
for x in genes:
    output = "\t".join(x, str(genes[x]))
    print(output, file = out)

out.close()
ADD COMMENTlink modified 22 months ago • written 22 months ago by jared.andrews074.3k

Sorry im not proficient in python, assuming i want to put your script in a bash for loop for a list of files, could you tell me how to complete your script?

ADD REPLYlink written 22 months ago by lessismore740

If you want a single output file for each input file (rather than one for many input files), you can just run the script several times from the command line for your files - which is particularly easy if you segregate them and they have a common file extension:

for x in *.txt; do
    python my_python_script.py "$x" "$x".out
done

I've edited my answer slightly to set up the input and output files.

ADD REPLYlink written 22 months ago by jared.andrews074.3k

i get this error

  File "filter_unique_best_score.py", line 25
    print(output, file = out)
                       ^
SyntaxError: invalid syntax
ADD REPLYlink modified 22 months ago • written 22 months ago by lessismore740

Are you using python 3? If you're using python 2, try adding this as the first line in the script: from __future__ import print_function.

ADD REPLYlink written 22 months ago by jared.andrews074.3k

unfortunately yes.

i did and i get this

Traceback (most recent call last):
  File "filter_unique_best_score.py", line 26, in <module>
    output = "\t".join(x, string(genes[x]))
NameError: name 'string' is not defined
ADD REPLYlink written 22 months ago by lessismore740

Oh, that's my mistake. Mixing languages too frequently, python uses str for string conversions, not string. I've updated the code.

ADD REPLYlink written 22 months ago by jared.andrews074.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 867 users visited in the last hour