Hi all,
I was searching the net for a solution for my problem... unfortunately nothing so far. I want to sort on more than on column tab delimited file and keep the line if in the column I sort there is no value, but for those who have a value I want them only unique.
I have tried the options: Code:
sort -u -k 5,5 input file| awk '!seen[$12]++'| grep 'IPR013087'
but here I lose the lines that have nothing in the 12th column... another option: Code:
sort -u -k 5,5 Acropora_digitifera_protein.fasta.tsv| awk -F "\t" '{if ($12=="") print $0; else; !seen[$12]++}'| grep 'IPR013087'
here it looks like "!seen[$12]++}" do nothing and the output empty. :( I want to keep all lines but have the unique once by the 5th column and by the 12th column, meaning the lines that have no value in the 12th column should be kept (keep the line).
More in details:
My data set:
ACDI|gi|1005438440|ref|XP_015756623.1| 855e9b79f65e051746158c0f63a763a6 427 Pfam PF00096 Zinc finger, C2H2 type 328 350 3.2E-5 T 14-02-2019 IPR013087 Zinc finger C2H2-type
ACDI|gi|1005438440|ref|XP_015756623.1| 855e9b79f65e051746158c0f63a763a6 427 SMART SM00355 356 378 5.5E-5 T 14-02-2019 IPR013087 Zinc finger C2H2-type
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 MobiDBLite mobidb-lite consensus disorder prediction 646 688 - T 14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 Gene3D G3DSA:3.90.70.10 88 176 3.5E-13 T 14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 Gene3D G3DSA:3.90.70.10 195 496 2.0E-66 T 14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 Gene3D G3DSA:3.10.20.90 964 1044 1.5E-5 T 14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 Pfam PF00443 Ubiquitin carboxyl-terminal hydrolase 96 492 1.9E-39 T 14-02-2019 IPR001394 Peptidase C19, ubiquitin carboxyl-terminal hydrolase
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 MobiDBLite mobidb-lite consensus disorder prediction 130 181 - T 14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 CDD cd02668 Peptidase_C19L 97 493 5.37034E-140 T 14-02-2019 IPR033841 Ubiquitin-specific peptidase 48
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 MobiDBLite mobidb-lite consensus disorder prediction 933 974 - T 14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 MobiDBLite mobidb-lite consensus disorder prediction 944 961 - T 14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 MobiDBLite mobidb-lite consensus disorder prediction 130 178 - T 14-02-2019 Acropora_digitifera_protein.fasta.tsv
ACDI|gi|1005433616|ref|XP_015754353.1| 3d8e1345398c9346035f2aaf36a0ba63 227 MobiDBLite mobidb-lite consensus disorder prediction 1 49 - T 14-02-2019
ACDI|gi|1005492169|ref|XP_015748752.1| 9649b3b2f3e16813b541cc225f16e7e5 196 Pfam PF04103 CD20-like family 13 156 7.0E-7 T 14-02-2019 IPR007237 CD20-like family
ACDI|gi|1005474816|ref|XP_015774180.1| 1169df0014aa2b06a4e07981d056bbcc 211 Pfam PF03184 DDE superfamily endonuclease 3 140 1.8E-22 T 14-02-2019 IPR004875 DDE superfamily endonuclease domain
ACDI|gi|1005478159|ref|XP_015775824.1| 801de18fcf5e339f411fe95038ca00f3 192 CDD cd01670 Death 148 181 1.65022E-6 T 14-02-2019
ACDI|gi|1005435757|ref|XP_015755391.1| 50dff494b456096e706288e96a1506e0 207 MobiDBLite mobidb-lite consensus disorder prediction 130 180 - T 14-02-2019
ACDI|gi|1005480051|ref|XP_015776754.1| c4efb60815fdf57cf0244dacf475f25d 266 Pfam PF14997 CECR6/TMEM121 family 66 244 1.4E-22 T 14-02-2019 IPR032776 CECR6/TMEM121 family
ACDI|gi|1005453471|ref|XP_015763894.1| 4a622b0f2466759e2ab0e050856d6fcc 143 Pfam PF04752 ChaC-like protein 6 123 1.8E-26 T 14-02-2019 IPR006840 Glutathione-specific gamma-glutamylcyclotransferase
ACDI|gi|1005420589|ref|XP_015757954.1| 5cbfe3f69839493b89232b2be5be6b49 190 Pfam PF08499 3'5'-cyclic nucleotide phosphodiesterase N-terminal 137 188 5.8E-11 T 14-02-2019 IPR013706 3'5'-cyclic nucleotide phosphodiesterase N-terminal
ACDI|gi|1005471241|ref|XP_015772489.1| c5c2e6c3d63d0d13b87ad195f58f54e6 234 Pfam PF15745 AP-1 complex-associated regulatory protein 27 178 4.5E-17 T 14-02-2019 IPR031483 AP-1 complex-associated regulatory protein
ACDI|gi|1005448265|ref|XP_015761397.1| 4e8c83abd5bd43fcf3d681da11c99ac7 135 Gene3D G3DSA:1.20.1250.20 1 112 3.0E-10 T 14-02-2019
I want to sort by the 5th and the 12 column and have no duplicates for the two of them. the 5h - is the method hit number (for example cd/G3D/PF etc) and the 12th - is the interpro hit number (IPR)
so the output should contain unique lines by the 5th column and by the 12th column even if nothing in the 12th, like here:
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 MobiDBLite mobidb-lite consensus disorder prediction 646 688 - T 14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 Gene3D G3DSA:3.90.70.10 88 176 3.5E-13 T 14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 Gene3D G3DSA:3.10.20.90 964 1044 1.5E-5 T 14-02-2019
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 Pfam PF00443 Ubiquitin carboxyl-terminal hydrolase 96 492 1.9E-39 T 14-02-2019 IPR001394 Peptidase C19, ubiquitin carboxyl-terminal hydrolase
ACDI|gi|1005424999|ref|XP_015778892.1| b9256e77b6b45c2b8267a9b18f5535e7 1063 CDD cd02668 Peptidase_C19L 97 493 5.37034E-140 T 14-02-2019 IPR033841 Ubiquitin-specific peptidase 48
ACDI|gi|1005433616|ref|XP_015754353.1| 3d8e1345398c9346035f2aaf36a0ba63 227 MobiDBLite mobidb-lite consensus disorder prediction 1 49 - T 14-02-2019
ACDI|gi|1005492169|ref|XP_015748752.1| 9649b3b2f3e16813b541cc225f16e7e5 196 Pfam PF04103 CD20-like family 13 156 7.0E-7 T 14-02-2019 IPR007237 CD20-like family
ACDI|gi|1005474816|ref|XP_015774180.1| 1169df0014aa2b06a4e07981d056bbcc 211 Pfam PF03184 DDE superfamily endonuclease 3 140 1.8E-22 T 14-02-2019 IPR004875 DDE superfamily endonuclease domain
ACDI|gi|1005478159|ref|XP_015775824.1| 801de18fcf5e339f411fe95038ca00f3 192 CDD cd01670 Death 148 181 1.65022E-6 T 14-02-2019
ACDI|gi|1005480051|ref|XP_015776754.1| c4efb60815fdf57cf0244dacf475f25d 266 Pfam PF14997 CECR6/TMEM121 family 66 244 1.4E-22 T 14-02-2019 IPR032776 CECR6/TMEM121 family
ACDI|gi|1005453471|ref|XP_015763894.1| 4a622b0f2466759e2ab0e050856d6fcc 143 Pfam PF04752 ChaC-like protein 6 123 1.8E-26 T 14-02-2019 IPR006840 Glutathione-specific gamma-glutamylcyclotransferase
ACDI|gi|1005420589|ref|XP_015757954.1| 5cbfe3f69839493b89232b2be5be6b49 190 Pfam PF08499 3'5'-cyclic nucleotide phosphodiesterase N-terminal 137 188 5.8E-11 T 14-02-2019 IPR013706 3'5'-cyclic nucleotide phosphodiesterase N-terminal
ACDI|gi|1005471241|ref|XP_015772489.1| c5c2e6c3d63d0d13b87ad195f58f54e6 234 Pfam PF15745 AP-1 complex-associated regulatory protein 27 178 4.5E-17 T 14-02-2019 IPR031483 AP-1 complex-associated regulatory protein
ACDI|gi|1005448265|ref|XP_015761397.1| 4e8c83abd5bd43fcf3d681da11c99ac7 135 Gene3D G3DSA:1.20.1250.20 1 112 3.0E-10 T 14-02-2019
ACDI|gi|1005438440|ref|XP_015756623.1| 855e9b79f65e051746158c0f63a763a6 427 Pfam PF00096 Zinc finger, C2H2 type 328 350 3.2E-5 T 14-02-2019 IPR013087 Zinc finger C2H2-type
Thanks for reading until here! Hope someone will have a solution for that!
of curse I can have a solution in more than one line, but it will be better to have one line solution...
Thanks a lot! :)
it's quite difficult to understand what you're trying to do. However I see that your input contains some whitespaces (e.g:
cyclic nucleotide phosphodiesterase
). The default delimiter for awk is not tab but any whitespace so, when I seesort -u -k 5,5 input file| awk '!seen[$12]++'
I think you should specify the delimiter in both commands....