how to discard same hypothetical protein in 300 strains protein file
0
0
Entering edit mode
21 months ago
Neel ▴ 20

Hi, i have almost 990798 hypothetical protein across 300 strains, so my question is how i can remove duplicates from it ? Actually i have try to sort but it gave same number because header is different that's why i think it consider all hypothetical header unique but their sequence must be same for few protein in two different strains.

enter image description here

Thank you!

annotation fasta • 672 views
ADD COMMENT
0
Entering edit mode

you have to show us an example of input.

ADD REPLY
0
Entering edit mode

Are you interested in removing sequences that have hypothetical word in header or actually sequences that are duplicates (irrespective of what they say in the header).

For first case, you can do (Pierrer's fasta code)

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' your.fa | grep -v "hypothetical" | tr "\t" "\n" | fold -w 80 > clean.fa

for latter, you will need to use a program like cd-hit that actually looks at the sequence.

ADD REPLY
0
Entering edit mode

Thank you so much for your reply, actually i want to remove duplicates sequences.

ADD REPLY

Login before adding your answer.

Traffic: 2363 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6