Question: how to write a script to grep the lowest e-value for each gene?
0
gravatar for Alexie Li
3.6 years ago by
Alexie Li20
Alexie Li20 wrote:

Hi all,

I have a file containing blast hits for each gene, one hit per line, as following

ID  E-value

g1 0.0

g1 1e-20

g1 1e-5

g2 1e-5

g2 1e-13

g3 0.01

I would like to grep the lowest e-value for each gene, how to write a script to make it?

Thank you!

Alexie

linux script • 1.1k views
ADD COMMENTlink modified 3.6 years ago by venu6.8k • written 3.6 years ago by Alexie Li20
1

Why not just do the blast over with -max_target_seqs 1?

ADD REPLYlink written 3.6 years ago by st.ph.n2.6k

Actually, they are a merged file of results generated by different methods, I want to check which method generally give the best results or more reliable.

ADD REPLYlink written 3.6 years ago by Alexie Li20

my data: blast_scores.txt

ID  E-value
g1  0.0
g1  1e-20
g1  1e-5
g2  1e-5
g2  1e-13
g3  0.01

code:

$ datamash -H -g 1 min 2 < blast_scores.txt

output:

$ datamash -H -g 1 min 2 < blast_scores.txt 
GroupBy(ID) min(E-value)
g1  0
g2  1e-13
g3  0.01

datamash is command line tool from GNU

ADD REPLYlink modified 3.6 years ago • written 3.6 years ago by cpad011215k
1
gravatar for st.ph.n
3.6 years ago by
st.ph.n2.6k
Philadelphia, PA
st.ph.n2.6k wrote:

Assuming input file is tab-delimited. If not, reformat, or change the code below. Save code as sort_eval.py and run as python sort_eval.py file.txt

#!/usr/bin/env python
import sys
from collections import defaultdict

evalue = defaultdict(list)

with open(sys.argv[1], 'r') as f:
        next(f)
        for line in f:
                evalue[line.strip().split('\t')[0]].append(line.strip().split('\t')[1])

for i in sorted(evalue):
        print i, '\t', sorted(evalue[i])[0]

Output:

g1      0.0
g2      1e-13
g3      0.01
ADD COMMENTlink modified 3.6 years ago • written 3.6 years ago by st.ph.n2.6k
1
gravatar for venu
3.6 years ago by
venu6.8k
Germany
venu6.8k wrote:

Something like this

show contents | remove header | sort based on first column followed by second | keep rows based on first occurrence of first column

code

cat test.test | grep -v ID | sort -k1,1 -k2,2n | awk '!a[$1]++'

result

g1 0.0
g2 1e-13
g3 0.01

P.S. You can directly pass input file to grep i.e. grep -v ID text.txt | ...

ADD COMMENTlink written 3.6 years ago by venu6.8k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2934 users visited in the last hour
_