any script that can do this task
Hi everyone.

I have a Protein dataset like this

Column 1 has protein IDs, column 2 has domain

ABCD_peg_0001 wzz
ABCD_peg_0002 no domain
ABCD_peg_0003 wza
ABCD_peg_0004 no domain
PQRS_peg_0012 no domain
PQRS_peg_0013 wca
PQRS_peg_0014 wzc
PQRS_peg_0015 no domain


At the beginning it has organism names and then peg number that is sequential in order for each organism and then 2nd column is domain type

I want the output to be like

ABCD_peg_0001 wzz
ABCD_peg_0002 no domain
ABCD_peg_0003 wza
---
PQRS_peg_0013 wca
PQRS_peg_0014 wzc


It means it will print everything in between two known domains and remove remaining.

But only if that is falling under the range of +/-10 .

If something beyond that range it won't print it as a cluster. And there will be a separation line in between each combination.

And if a domain is alone and there is no known domain in that +/-10 range it will print that one domain only.

Suppose somewhere it's like

XYZ_peg_0060 no domain
XYZ_peg_0061 no domain
XYZ_peg_0062 wzz
XYZ_peg_0063 no domain


Output will be

XYZ_peg_0062 wzz


and some times the protein IDs can be like

WXY_123_peg0012 wxa
WXY_123_peg0013 no domain
WXY_123_peg0014 wzz


A digit before peg numbers will be there in few cases.

I have tried shell scripting grep -A 10 -B 10, but it did not work. Please suggest me

Thank you

Hi, I guess it would be polite and fair to note that you have already a (near) perfect solution. Which should be easily modifiable into what you want. Adding a line between different organisms should be trivial, and to be frank, you should at least put in that little effort and make an attempt first. Honestly, as I now notice that you might mistake biostars for a site for free homework help or free freelance programming services, if you need more and a custom made script you may contact me for support at my normal rate of 100Eur/h. Cheers.