How to filter "productive" amino acid sequences
1
0
Entering edit mode
7 months ago
sil_bioinfo ▴ 40

Hello,

I have a fasta file with different amino acid sequences, for example:

>abc
HSTSDSAQTMFPVALLLLAAGSCVKGEQLTQPTSVTVQPGQRLTITCQVSYSLGTYFTAW
IRQPAGKGLEWIGMRSTGASYYKDSLKNKFSIDLDTSSKTVTLNGQNVQPEDTAVYYCAR
APSRGFDYWGKGTMVTITSATPKGPTVFPL

>def
TARQIQHKPCFL*LCCCWQLDHV*RVNS*HSRPL*LCSQVNV*PSPVRSLILLVPTSQLG
SDSLQEKDWSGLE*DLLELHTTKIH*RTSSVST*TLPAKL*L*MDRMCSLKTLLCITVPE
RPVGVLTTGGKAPWSPSPRPPQRDQLCFL*

>ghi
GSQHVRFSTNHVSCSSAAVGSWIMCEG*TVDTADLCDCAARSTSDHHLSGLLFSW*LLHS
LDQTACRKRTGVDWEQIYWSCILQRFIKEQVQYRLRHFQQNCDSKWTECAA*RHCCVLLC
QTTGSGSWLLGERHHGHHHLGHPKGTNCVSS

and I want to filter out the sequences that are "productive" from the "non-productive" ones.

Additional info: I had translated every DNA sequence to amino acid sequence in all 6 frames.

By "non-productive" I mean those that don't translate into proteins (don't have the amino acid M and/or have too many stop codons). I would like to filter out these non-productive sequences in a fasta file.

As for the "productive" ones, I would also like to save every "productive" sequence only with the complete frame in another fasta file.

Is there any software tool where I can do this? If there isn't, I'm trying to do it in python... but I'm stuck... Any ideas you can come up with are welcome.

Thank you in advance

protein fasta • 561 views
ADD COMMENT
1
Entering edit mode

Please do not delete posts once they have at least one comment or an answer.

ADD REPLY
0
Entering edit mode

don't have methionine

At beginning of sequence?

ADD REPLY
0
Entering edit mode

in general, around all the sequence

ADD REPLY
0
Entering edit mode
7 months ago
GenoMax 141k

If you are simply looking to filter out sequences that contain a stop (*) then you can do the following:

Code to linearize fasta courtesy of @Pierre.

$ awk '/^>/ {printf("%s%s\t",(N>0?"\n":""),$0);N++;next;} {printf("%s",$0);} END {printf("\n");}' < test.fasta | grep -v "*" | tr "\t" "\n" | fold -w 60
>abc
HSTSDSAQTMFPVALLLLAAGSCVKGEQLTQPTSVTVQPGQRLTITCQVSYSLGTYFTAW
IRQPAGKGLEWIGMRSTGASYYKDSLKNKFSIDLDTSSKTVTLNGQNVQPEDTAVYYCAR
APSRGFDYWGKGTMVTITSATPKGPTVFPL
ADD COMMENT
0
Entering edit mode

Hi, I would like to filter out sequences that, for example, don't have a methionine (M) and/or have a lot of stop codons (*) in the middle, not just one. These sequences would be the "non-productive" ones, and I would like to create a fasta file with these sequences too.

ADD REPLY

Login before adding your answer.

Traffic: 1958 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6