finding overlapping genes
1
0
Entering edit mode
4.9 years ago
Fatima ▴ 960

I need a list of lines that do not have any overlap, and a list of lines that have overlap. We can have any number of genes in each line (separated by comma)! I like to do it in awk, but I'm not very familiar with all the commands.

L1 ycjM,ycjN,ycjO,ycjP,ycjQ,ycjR,ycjS,ycjT,ycjU,ycjV,ymjB

L2 ydaS,ydaT, ydaU,ydaV,ydaW,rzpR

L3 ompn

L4 ycjX,ycjF

L5 ycjX,ycjF,tyrR

.................

Non-overlapping lines: L1 L2 L3

Overlapping lines: L4 L5

awk overlap • 1.2k views
ADD COMMENT
3
Entering edit mode
4.9 years ago

One way to do this in awk is with associative arrays.

Associative arrays are data structures that store key-value pairs. They are also known as hash tables in Perl, or dictionaries in Python.

You could keep a pair of arrays: one to store the line number of an element, and another to flag whether or not that element is was found on a previous line.

Perhaps you can modify the logic here to get the result you want:

$ awk -F',' 'BEGIN { n = 1; }{ for (i = 1; i <= NF; i++) { if ($i in ht) { print "non-unique element "$i" on line "ht[$i]" also found on line "NR; for (j in f) { if (f[j] == $i) f[j] = 0; } } else { ht[$i] = NR; f[n] = $i; n++; } } } END { for (i = 1; i < n; i++) { if (f[i] != 0) { print "unique element "f[i]" found on line "ht[f[i]]; } } }' genes.txt
non-unique element ycjX on line 4 also found on line 5
non-unique element ycjF on line 4 also found on line 5
unique element ycjM found on line 1
unique element ycjN found on line 1
unique element ycjO found on line 1
unique element ycjP found on line 1
unique element ycjQ found on line 1
unique element ycjR found on line 1
unique element ycjS found on line 1
unique element ycjT found on line 1
unique element ycjU found on line 1
unique element ycjV found on line 1
unique element ymjB found on line 1
unique element ydaS found on line 2
unique element ydaT found on line 2
unique element ydaU found on line 2
unique element ydaV found on line 2
unique element ydaW found on line 2
unique element rzpR found on line 2
unique element ompn found on line 3
unique element tyrR found on line 5
ADD COMMENT
0
Entering edit mode

Thank you :)

I modified it to this code, and it worked :)

awk -F '[,  \t]' 'BEGIN { n = 1; }{ for (**i = 2**; i <= NF; i++) { if ($i in ht) { print "non-unique element "$i" was found on "ht[$i]" and **"$1"** "; for (j in f) { if (f[j] == $i) f[j] = 0; } } else { ht[$i] = $1 ; f[n] = $i; n++; } } } END { for **(i = 2**; i < n; i++) { if (f[i] != 0) { print "unique element "f[i]" found on line "ht[f[i]]; } } }' |\

How can I tell it to consider tab comma and space as separators? : '[, \t]' I guess this one doesn't consider white space.

ADD REPLY
0
Entering edit mode

Perhaps use tr or sed to strip out tabs and spaces, e.g.:

$ sed 's/[\t| ]//g' genes.txt | awk -F',' ...
ADD REPLY

Login before adding your answer.

Traffic: 1004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6