Seqkit - change regex for parsing ID
0
0
Entering edit mode
16 months ago

Hi guys, I am trying to use seqkit rmdup to remove duplicated sequences from my protein fasta files. However, it's only the accession numbers which are duplicated and not the description or sequences. See example below.

Host_331002_c0_seq1 95 1381 2 +
Host_331002_c0_seq1 1873 2112 1 +


So basically I want to set a flag which will stop at the first tab when searching the identifiers otherwise I won't get any duplicates in my output file. I think this flag would fix it but I am not sure what to enter as regex

  --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")


I just started learning all the programming languages and I am not certain how to change the default so it will stop after " Host_331002_c0_seq1" . Thanks is advance for your help!

Cheers!

seqkit database duplicates regex • 501 views
0
Entering edit mode

seems to be working fine for me. Please post specific example. input:

$cat test.fa >Host_331002_c0_seq1 95 1381 2 + atgc >Host_331002_c0_seq1 1873 2112 1 + atgc  output: $ seqkit --quiet rmdup test.fa
>Host_331002_c0_seq1 95 1381 2 +
atgc

0
Entering edit mode

Oh I see. Well, the sequences are different in my case. See example below.

>Host_331002_c0_seq1 95 1381 2 +
MXETIKQVVVDEVSLLFVKPREXXYNFYCTEKELNKYXRDLCTGETMTGR
VIIVTGASSGLGYETARYLCEGGNDVILACRDEEKAKRAIDRIKQQNPNA
LATYMHLDLSSLESVRKFVDEFHATGKKLHVIVNNAGLALNFKDIKRQYT
KRTRNLQPLDTENLFLFDEGSFNGLQAYKNSKAANVMFAYELARKLSGSG
VKVNAVNPGNVPSTDLMRHAGNAEKLFSRCVLHGVLRFTKMTRTIPQGAQ
FICSVATDDKYKDVTGKYLKEGQEATSSEETQSEELQTKLWEISGRYTSL
DGYEPLAAPPRPVEEDSKPKEQEEKKVDEDTKAAETAGQEKEEDEQKNEE

>Host_331002_c0_seq1 1873 2112 1 +
MFGAIIKKPIAFPAXXDDDGAEDLGEIASAAAGSGKGSSQAESDYFDIVR


So it is really just the accession number which was used twice in the database I am working with. Since both the whole identifier (accession + description) and the sequences are different seqkit does not identify it as duplicates. My goal is to pull out the sequences with duplicated accessions and rename them manually. My database contains more then 1.5 million sequences so doing it manually is not really an option. Thanks again for your help!

1
Entering edit mode
1. In my first post, de-duplication was based ID, not on sequence.
2. I ran the same function on example fasta you furnished and it is working as expected (keeping the first sequence of the duplicates):

input:

$cat test.fa >Host_331002_c0_seq1 95 1381 2 + MXETIKQVVVDEVSLLFVKPREXXYNFYCTEKELNKYXRDLCTGETMTGR VIIVTGASSGLGYETARYLCEGGNDVILACRDEEKAKRAIDRIKQQNPNA LATYMHLDLSSLESVRKFVDEFHATGKKLHVIVNNAGLALNFKDIKRQYT ADGFELTIGTNHLGPFLLTNLLLDDLNKAAENGDARVVVVTSALHDARCC KRTRNLQPLDTENLFLFDEGSFNGLQAYKNSKAANVMFAYELARKLSGSG VKVNAVNPGNVPSTDLMRHAGNAEKLFSRCVLHGVLRFTKMTRTIPQGAQ FICSVATDDKYKDVTGKYLKEGQEATSSEETQSEELQTKLWEISGRYTSL DGYEPLAAPPRPVEEDSKPKEQEEKKVDEDTKAAETAGQEKEEDEQKNEE KDADKSAEEKPKIEDCDKSAEETTKAVD >Host_331002_c0_seq1 1873 2112 1 + MFGAIIKKPIAFPAXXDDDGAEDLGEIASAAAGSGKGSSQAESDYFDIVR VAELACRRSSNTEKTRCIAQLICCFFILL  output $ seqkit --quiet rmdup test.fa
>Host_331002_c0_seq1 95 1381 2 +
MXETIKQVVVDEVSLLFVKPREXXYNFYCTEKELNKYXRDLCTGETMTGRVIIVTGASSG
LGYETARYLCEGGNDVILACRDEEKAKRAIDRIKQQNPNALATYMHLDLSSLESVRKFVD
ENGDARVVVVTSALHDARCCKRTRNLQPLDTENLFLFDEGSFNGLQAYKNSKAANVMFAY
ELARKLSGSGVKVNAVNPGNVPSTDLMRHAGNAEKLFSRCVLHGVLRFTKMTRTIPQGAQ
FICSVATDDKYKDVTGKYLKEGQEATSSEETQSEELQTKLWEISGRYTSLDGYEPLAAPP
EETTKAVD


on the other hand, if id is always followed by a space, you can use awk to print id and sequence and then de-duplicate the file. But keep a back up of your data first.

0
Entering edit mode

Interesting! Repeating your commands using a test file worked for me as well so it seems to be an issue with the format of my input database. Thank you!