Question

Seqkit - change regex for parsing ID

0

Entering edit mode

5.1 years ago

marlenejensen ▴ 20

Hi guys, I am trying to use seqkit rmdup to remove duplicated sequences from my protein fasta files. However, it's only the accession numbers which are duplicated and not the description or sequences. See example below.

Host_331002_c0_seq1 95 1381 2 + 
Host_331002_c0_seq1 1873 2112 1 +

So basically I want to set a flag which will stop at the first tab when searching the identifiers otherwise I won't get any duplicates in my output file. I think this flag would fix it but I am not sure what to enter as regex

  --id-regexp string                regular expression for parsing ID (default "^(\\S+)\\s?")

I just started learning all the programming languages and I am not certain how to change the default so it will stop after " Host_331002_c0_seq1" . Thanks is advance for your help!

Cheers!

seqkit database duplicates regex • 1.8k views

ADD COMMENT • link 5.1 years ago by marlenejensen ▴ 20

0

Entering edit mode

seems to be working fine for me. Please post specific example. input:

$ cat test.fa 
>Host_331002_c0_seq1 95 1381 2 + 
atgc
>Host_331002_c0_seq1 1873 2112 1 +
atgc

output:

$ seqkit --quiet rmdup test.fa 
>Host_331002_c0_seq1 95 1381 2 + 
atgc

ADD REPLY • link 5.1 years ago by cpad0112 21k

0

Entering edit mode

Oh I see. Well, the sequences are different in my case. See example below.

>Host_331002_c0_seq1 95 1381 2 +
MXETIKQVVVDEVSLLFVKPREXXYNFYCTEKELNKYXRDLCTGETMTGR
VIIVTGASSGLGYETARYLCEGGNDVILACRDEEKAKRAIDRIKQQNPNA
LATYMHLDLSSLESVRKFVDEFHATGKKLHVIVNNAGLALNFKDIKRQYT
ADGFELTIGTNHLGPFLLTNLLLDDLNKAAENGDARVVVVTSALHDARCC
KRTRNLQPLDTENLFLFDEGSFNGLQAYKNSKAANVMFAYELARKLSGSG
VKVNAVNPGNVPSTDLMRHAGNAEKLFSRCVLHGVLRFTKMTRTIPQGAQ
FICSVATDDKYKDVTGKYLKEGQEATSSEETQSEELQTKLWEISGRYTSL
DGYEPLAAPPRPVEEDSKPKEQEEKKVDEDTKAAETAGQEKEEDEQKNEE
KDADKSAEEKPKIEDCDKSAEETTKAVD

>Host_331002_c0_seq1 1873 2112 1 +
MFGAIIKKPIAFPAXXDDDGAEDLGEIASAAAGSGKGSSQAESDYFDIVR
VAELACRRSSNTEKTRCIAQLICCFFILL

So it is really just the accession number which was used twice in the database I am working with. Since both the whole identifier (accession + description) and the sequences are different seqkit does not identify it as duplicates. My goal is to pull out the sequences with duplicated accessions and rename them manually. My database contains more then 1.5 million sequences so doing it manually is not really an option. Thanks again for your help!

ADD REPLY • link updated 5.1 years ago by cpad0112 21k • written 5.1 years ago by marlenejensen ▴ 20

1

Entering edit mode

In my first post, de-duplication was based ID, not on sequence.
I ran the same function on example fasta you furnished and it is working as expected (keeping the first sequence of the duplicates):

input:

$ cat test.fa 
>Host_331002_c0_seq1 95 1381 2 +
MXETIKQVVVDEVSLLFVKPREXXYNFYCTEKELNKYXRDLCTGETMTGR
VIIVTGASSGLGYETARYLCEGGNDVILACRDEEKAKRAIDRIKQQNPNA
LATYMHLDLSSLESVRKFVDEFHATGKKLHVIVNNAGLALNFKDIKRQYT
ADGFELTIGTNHLGPFLLTNLLLDDLNKAAENGDARVVVVTSALHDARCC
KRTRNLQPLDTENLFLFDEGSFNGLQAYKNSKAANVMFAYELARKLSGSG
VKVNAVNPGNVPSTDLMRHAGNAEKLFSRCVLHGVLRFTKMTRTIPQGAQ
FICSVATDDKYKDVTGKYLKEGQEATSSEETQSEELQTKLWEISGRYTSL
DGYEPLAAPPRPVEEDSKPKEQEEKKVDEDTKAAETAGQEKEEDEQKNEE
KDADKSAEEKPKIEDCDKSAEETTKAVD

>Host_331002_c0_seq1 1873 2112 1 +
MFGAIIKKPIAFPAXXDDDGAEDLGEIASAAAGSGKGSSQAESDYFDIVR
VAELACRRSSNTEKTRCIAQLICCFFILL

output

$ seqkit --quiet rmdup test.fa 
>Host_331002_c0_seq1 95 1381 2 +
MXETIKQVVVDEVSLLFVKPREXXYNFYCTEKELNKYXRDLCTGETMTGRVIIVTGASSG
LGYETARYLCEGGNDVILACRDEEKAKRAIDRIKQQNPNALATYMHLDLSSLESVRKFVD
EFHATGKKLHVIVNNAGLALNFKDIKRQYTADGFELTIGTNHLGPFLLTNLLLDDLNKAA
ENGDARVVVVTSALHDARCCKRTRNLQPLDTENLFLFDEGSFNGLQAYKNSKAANVMFAY
ELARKLSGSGVKVNAVNPGNVPSTDLMRHAGNAEKLFSRCVLHGVLRFTKMTRTIPQGAQ
FICSVATDDKYKDVTGKYLKEGQEATSSEETQSEELQTKLWEISGRYTSLDGYEPLAAPP
RPVEEDSKPKEQEEKKVDEDTKAAETAGQEKEEDEQKNEEKDADKSAEEKPKIEDCDKSA
EETTKAVD

on the other hand, if id is always followed by a space, you can use awk to print id and sequence and then de-duplicate the file. But keep a back up of your data first.

ADD REPLY • link 5.1 years ago by cpad0112 21k

0

Entering edit mode

Interesting! Repeating your commands using a test file worked for me as well so it seems to be an issue with the format of my input database. Thank you!

ADD REPLY • link 5.1 years ago by marlenejensen ▴ 20