Interpro results analysis
1
0
Entering edit mode
5.2 years ago

Hi all,

I was searching the net for a solution for my problem... unfortunately nothing so far. I want to sort on more than on column tab delimited file and keep the line if in the column I sort there is no value, but for those who have a value I want them only unique.

I have tried the options: Code:

sort -u -k 5,5 input file| awk '!seen[$12]++'| grep 'IPR013087'

but here I lose the lines that have nothing in the 12th column... another option: Code:

sort -u -k 5,5 Acropora_digitifera_protein.fasta.tsv| awk -F "\t" '{if ($12=="") print $0; else; !seen[$12]++}'| grep 'IPR013087'

here it looks like "!seen[$12]++}" do nothing and the output empty. :( I want to keep all lines but have the unique once by the 5th column and by the 12th column, meaning the lines that have no value in the 12th column should be kept (keep the line).

More in details:

My data set:

    ACDI|gi|1005438440|ref|XP_015756623.1|  855e9b79f65e051746158c0f63a763a6    427 Pfam    PF00096 Zinc finger, C2H2 type  328 350 3.2E-5  T   14-02-2019  IPR013087   Zinc finger C2H2-type 
ACDI|gi|1005438440|ref|XP_015756623.1|  855e9b79f65e051746158c0f63a763a6    427 SMART   SM00355 356 378 5.5E-5  T   14-02-2019  IPR013087   Zinc finger C2H2-type 
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7    1063    MobiDBLite  mobidb-lite consensus disorder prediction   646 688 -   T   14-02-2019 
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7    1063    Gene3D  G3DSA:3.90.70.10    88  176 3.5E-13 T   14-02-2019 
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7    1063    Gene3D  G3DSA:3.90.70.10    195 496 2.0E-66 T   14-02-2019 
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7    1063    Gene3D  G3DSA:3.10.20.90    964 1044    1.5E-5  T   14-02-2019 
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7    1063    Pfam    PF00443 Ubiquitin carboxyl-terminal hydrolase   96  492 1.9E-39 T   14-02-2019  IPR001394   Peptidase C19, ubiquitin carboxyl-terminal hydrolase 
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7    1063    MobiDBLite  mobidb-lite consensus disorder prediction   130 181 -   T   14-02-2019 
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7    1063    CDD cd02668 Peptidase_C19L  97  493 5.37034E-140    T   14-02-2019  IPR033841   Ubiquitin-specific peptidase 48 
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7    1063    MobiDBLite  mobidb-lite consensus disorder prediction   933 974 -   T   14-02-2019 
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7    1063    MobiDBLite  mobidb-lite consensus disorder prediction   944 961 -   T   14-02-2019 
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7    1063    MobiDBLite  mobidb-lite consensus disorder prediction   130 178 -   T   14-02-2019 Acropora_digitifera_protein.fasta.tsv    
ACDI|gi|1005433616|ref|XP_015754353.1|  3d8e1345398c9346035f2aaf36a0ba63    227 MobiDBLite  mobidb-lite consensus disorder prediction   1   49  -   T   14-02-2019 
ACDI|gi|1005492169|ref|XP_015748752.1|  9649b3b2f3e16813b541cc225f16e7e5    196 Pfam    PF04103 CD20-like family    13  156 7.0E-7  T   14-02-2019  IPR007237   CD20-like family 
ACDI|gi|1005474816|ref|XP_015774180.1|  1169df0014aa2b06a4e07981d056bbcc    211 Pfam    PF03184 DDE superfamily endonuclease    3   140 1.8E-22 T   14-02-2019 IPR004875    DDE superfamily endonuclease domain 
ACDI|gi|1005478159|ref|XP_015775824.1|  801de18fcf5e339f411fe95038ca00f3    192 CDD cd01670 Death   148 181 1.65022E-6  T   14-02-2019 
ACDI|gi|1005435757|ref|XP_015755391.1|  50dff494b456096e706288e96a1506e0    207 MobiDBLite  mobidb-lite consensus disorder prediction   130 180 -   T   14-02-2019 
ACDI|gi|1005480051|ref|XP_015776754.1|  c4efb60815fdf57cf0244dacf475f25d    266 Pfam    PF14997 CECR6/TMEM121 family    66  244 1.4E-22 T   14-02-2019  IPR032776   CECR6/TMEM121 family 
ACDI|gi|1005453471|ref|XP_015763894.1|  4a622b0f2466759e2ab0e050856d6fcc    143 Pfam    PF04752 ChaC-like protein   6   123 1.8E-26 T   14-02-2019  IPR006840   Glutathione-specific gamma-glutamylcyclotransferase 
ACDI|gi|1005420589|ref|XP_015757954.1|  5cbfe3f69839493b89232b2be5be6b49    190 Pfam    PF08499 3'5'-cyclic nucleotide phosphodiesterase N-terminal 137 188 5.8E-11 T   14-02-2019  IPR013706   3'5'-cyclic nucleotide phosphodiesterase N-terminal 
ACDI|gi|1005471241|ref|XP_015772489.1|  c5c2e6c3d63d0d13b87ad195f58f54e6    234 Pfam    PF15745 AP-1 complex-associated regulatory protein  27  178 4.5E-17 T   14-02-2019  IPR031483   AP-1 complex-associated regulatory protein 
ACDI|gi|1005448265|ref|XP_015761397.1|  4e8c83abd5bd43fcf3d681da11c99ac7    135 Gene3D  G3DSA:1.20.1250.20  1   112 3.0E-10 T   14-02-2019

I want to sort by the 5th and the 12 column and have no duplicates for the two of them. the 5h - is the method hit number (for example cd/G3D/PF etc) and the 12th - is the interpro hit number (IPR)

so the output should contain unique lines by the 5th column and by the 12th column even if nothing in the 12th, like here:

ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    MobiDBLite      mobidb-lite     consensus disorder prediction   646     688     -       T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.90.70.10                88      176     3.5E-13 T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Gene3D  G3DSA:3.10.20.90                964     1044    1.5E-5  T       14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    Pfam    PF00443 Ubiquitin carboxyl-terminal hydrolase   96      492     1.9E-39 T       14-02-2019      IPR001394       Peptidase C19, ubiquitin carboxyl-terminal hydrolase
ACDI|gi|1005424999|ref|XP_015778892.1|  b9256e77b6b45c2b8267a9b18f5535e7        1063    CDD     cd02668 Peptidase_C19L  97      493     5.37034E-140    T       14-02-2019      IPR033841       Ubiquitin-specific peptidase 48
ACDI|gi|1005433616|ref|XP_015754353.1|  3d8e1345398c9346035f2aaf36a0ba63        227     MobiDBLite      mobidb-lite     consensus disorder prediction   1       49      -       T       14-02-2019
ACDI|gi|1005492169|ref|XP_015748752.1|  9649b3b2f3e16813b541cc225f16e7e5        196     Pfam    PF04103 CD20-like family        13      156     7.0E-7  T       14-02-2019      IPR007237       CD20-like family
ACDI|gi|1005474816|ref|XP_015774180.1|  1169df0014aa2b06a4e07981d056bbcc        211     Pfam    PF03184 DDE superfamily endonuclease    3       140     1.8E-22 T       14-02-2019      IPR004875       DDE superfamily endonuclease domain
ACDI|gi|1005478159|ref|XP_015775824.1|  801de18fcf5e339f411fe95038ca00f3        192     CDD     cd01670 Death   148     181     1.65022E-6      T       14-02-2019
ACDI|gi|1005480051|ref|XP_015776754.1|  c4efb60815fdf57cf0244dacf475f25d        266     Pfam    PF14997 CECR6/TMEM121 family    66      244     1.4E-22 T       14-02-2019      IPR032776       CECR6/TMEM121 family
ACDI|gi|1005453471|ref|XP_015763894.1|  4a622b0f2466759e2ab0e050856d6fcc        143     Pfam    PF04752 ChaC-like protein       6       123     1.8E-26 T       14-02-2019      IPR006840       Glutathione-specific gamma-glutamylcyclotransferase
ACDI|gi|1005420589|ref|XP_015757954.1|  5cbfe3f69839493b89232b2be5be6b49        190     Pfam    PF08499 3'5'-cyclic nucleotide phosphodiesterase N-terminal     137     188     5.8E-11 T       14-02-2019      IPR013706       3'5'-cyclic nucleotide phosphodiesterase N-terminal
ACDI|gi|1005471241|ref|XP_015772489.1|  c5c2e6c3d63d0d13b87ad195f58f54e6        234     Pfam    PF15745 AP-1 complex-associated regulatory protein      27      178     4.5E-17 T       14-02-2019      IPR031483       AP-1 complex-associated regulatory protein
ACDI|gi|1005448265|ref|XP_015761397.1|  4e8c83abd5bd43fcf3d681da11c99ac7        135     Gene3D  G3DSA:1.20.1250.20              1       112     3.0E-10 T       14-02-2019
ACDI|gi|1005438440|ref|XP_015756623.1|  855e9b79f65e051746158c0f63a763a6        427     Pfam    PF00096 Zinc finger, C2H2 type  328     350     3.2E-5  T       14-02-2019      IPR013087       Zinc finger C2H2-type

Thanks for reading until here! Hope someone will have a solution for that!

of curse I can have a solution in more than one line, but it will be better to have one line solution...

Thanks a lot! :)

column sort • 1.3k views
ADD COMMENT
1
Entering edit mode

it's quite difficult to understand what you're trying to do. However I see that your input contains some whitespaces (e.g: cyclic nucleotide phosphodiesterase). The default delimiter for awk is not tab but any whitespace so, when I see sort -u -k 5,5 input file| awk '!seen[$12]++' I think you should specify the delimiter in both commands....

ADD REPLY
0
Entering edit mode
5.2 years ago

You are right, as you can see I added it in the second try: 'awk -F "\t" ' Pierre Lindenbaum

I got the solution her: https://www.unix.com/unix-for-beginners-questions-and-answers/281202-awk-seen-else-loop.html

In case someone else will find it useful to them. :)

ADD COMMENT

Login before adding your answer.

Traffic: 2616 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6