Question

Append number to the fasta sequence header with the times sequence has been repeated

0

Entering edit mode

5.8 years ago

Gene-ticks ▴ 10

Hi All, I need some help with programming this little problem (either in perl or python) which I am not very familiar with. Is there a way to append number to a duplicate sequence name(like Xtimes?) to the very first duplicate sequence and discard the other duplicates.

>name_1_1
AGGGTTT
>name_:2:_X
GTTTGAA
>name_:3:_Y
GTTTGAA

Result I want :

>name_1_1
AGGGTTT
>name_:2:_X_2times
GTTTGAA

perl python fasta • 1.8k views

ADD COMMENT • link updated 5.8 years ago by Pierre Lindenbaum 164k • written 5.8 years ago by Gene-ticks ▴ 10

1

Entering edit mode

mirDeep (Perl code) does exactly this. It is quite simple, just build a hash with sequences as key, and names as values. If a key already exist, instead of assigning the sequence name as value, append a counter.

ADD REPLY • link 5.8 years ago by h.mon 35k

score 0 · Answer 1 · 2019-01-31

fasta are linearized. join two process: first process extract the DNA and count them. The second process sort the linearized fasta on the DNA.

 join -t $'\t' -1 2 -2 2 <(awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' jeter.fa | cut -f 2 | uniq -c | awk '{printf("%s\t%s\n",$1,$2);}' ) <(awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' jeter.fa | sort  -t $'\t' -u -k2,2) | awk '{printf("%s_%s\n%s\n",$3,$2,$1);}'
>name_1_1_1
AGGGTTT
>name_:2:_X_2
GTTTGAA

score 0 · Answer 2 · 2019-01-31

0

Entering edit mode

5.8 years ago

gb ★ 2.2k

Dont know if you necessarily need to program it yourself but I use vsearch --derep_fulllength for this, easy and fast

ADD COMMENT • link 5.8 years ago by gb ★ 2.2k