Hi guys, I am trying to use seqkit rmdup to remove duplicated sequences from my protein fasta files. However, it's only the accession numbers which are duplicated and not the description or sequences. See example below.
Host_331002_c0_seq1 95 1381 2 + Host_331002_c0_seq1 1873 2112 1 +
So basically I want to set a flag which will stop at the first tab when searching the identifiers otherwise I won't get any duplicates in my output file. I think this flag would fix it but I am not sure what to enter as regex
--id-regexp string regular expression for parsing ID (default "^(\\S+)\\s?")
I just started learning all the programming languages and I am not certain how to change the default so it will stop after " Host_331002_c0_seq1" . Thanks is advance for your help!