Question

How to Deduplicate files

0

Entering edit mode

5.0 years ago

saadleeshehreen ▴ 140

Hi,

I have list of 160111 protein files. Some of the files are duplication as GCA and GCF id contains same protein sequnces. How I can deduplicate the list on the basis of ASM102201v1?

Enterobacter_hormaechei-158836#GCA_001022015.1/GCA_001022015.1_ASM102201v1_protein.faa
Enterobacter_cloacae-550#GCF_001022015.1/GCF_001022015.1_ASM102201v1_protein.faa

sequence • 748 views

ADD COMMENT • link updated 5.0 years ago by finswimmer 16k • written 5.0 years ago by saadleeshehreen ▴ 140

score 0 · Answer 1 · 2019-05-03

0

Entering edit mode

5.0 years ago

finswimmer 16k

Try this:

$ awk -v FS="#" '{match($2, /(ASM[^_]+)/, asm)} !seen[asm[1]]++' input_file

ADD COMMENT • link 5.0 years ago by finswimmer 16k