I have a Protein dataset like this
Column 1 has protein IDs, column 2 has domain
ABCD_peg_0001 wzz ABCD_peg_0002 no domain ABCD_peg_0003 wza ABCD_peg_0004 no domain PQRS_peg_0012 no domain PQRS_peg_0013 wca PQRS_peg_0014 wzc PQRS_peg_0015 no domain
At the beginning it has organism names and then peg number that is sequential in order for each organism and then 2nd column is domain type
I want the output to be like
ABCD_peg_0001 wzz ABCD_peg_0002 no domain ABCD_peg_0003 wza --- PQRS_peg_0013 wca PQRS_peg_0014 wzc
It means it will print everything in between two known domains and remove remaining.
But only if that is falling under the range of +/-10 .
If something beyond that range it won't print it as a cluster. And there will be a separation line in between each combination.
And if a domain is alone and there is no known domain in that +/-10 range it will print that one domain only.
Suppose somewhere it's like
XYZ_peg_0060 no domain XYZ_peg_0061 no domain XYZ_peg_0062 wzz XYZ_peg_0063 no domain
Output will be
and some times the protein IDs can be like
WXY_123_peg0012 wxa WXY_123_peg0013 no domain WXY_123_peg0014 wzz
A digit before peg numbers will be there in few cases.
I have tried shell scripting
grep -A 10 -B 10, but it did not work. Please suggest me