how to discard same hypothetical protein in 300 strains protein file

0

Entering edit mode

21 months ago

Neel ▴ 20

Hi, i have almost 990798 hypothetical protein across 300 strains, so my question is how i can remove duplicates from it ? Actually i have try to sort but it gave same number because header is different that's why i think it consider all hypothetical header unique but their sequence must be same for few protein in two different strains.

enter image description here

Thank you!

annotation fasta • 672 views

ADD COMMENT • link 21 months ago by Neel ▴ 20

0

Entering edit mode

you have to show us an example of input.

ADD REPLY • link 21 months ago by Pierre Lindenbaum 161k

0

Entering edit mode

Are you interested in removing sequences that have hypothetical word in header or actually sequences that are duplicates (irrespective of what they say in the header).

For first case, you can do (Pierrer's fasta code)

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' your.fa | grep -v "hypothetical" | tr "\t" "\n" | fold -w 80 > clean.fa

for latter, you will need to use a program like cd-hit that actually looks at the sequence.

ADD REPLY • link 21 months ago by GenoMax 142k

0

Entering edit mode

Thank you so much for your reply, actually i want to remove duplicates sequences.

ADD REPLY • link 21 months ago by Neel ▴ 20

Login before adding your answer.