Combine rename and rmdup in SeqKit to remove duplicate sequences and append N in header? Sort by occurence?

0

Entering edit mode

29 days ago

Broccoli • 0

I have a FASTA-file like this:

>seqA 
AAAAAAAAAA
>seqB 
AAAAAAAAAA
>seqC 
TTTTTTTTTT
>seqD 
CCCCCCCCCC
>seqE 
CCCCCCCCCC
>seqF
AAAAAAAAAA

I'm recently learning SeqKit, and I've found that rename can append _N in the header based on the occurrence of the sequence, and also that rmdump can remove duplicates. Is it possible to have these two commands together? And if not, if I start appending _N, how do I make sure the highest number is kept when I remove duplicate sequences?

Maybe I'm not explaining myself well, and I'm all new to this, but basically, my end goal is preferably this:

>3 
AAAAAAAAAA
>1 
TTTTTTTTTT
>2 
CCCCCCCCCC

And if it's not possible to complete change the header, can the file be sorted by occurrence? Like this:

>seqA_3
AAAAAAAAAA
>seqD_2
CCCCCCCCCC
>seqC_1 
TTTTTTTTTT

And preferably it would be nice if the solution used SeqKit or another solution that is relatively low on memory, because my data set is very long.

seqkit fasta • 145 views

ADD COMMENT • link 29 days ago by Broccoli • 0

Login before adding your answer.