Hi every one,
I have a data example like follow, and I have to select the splice variant which has the longest prot. sequences and remove the rest from my.fasta
file. my.fasta
file has 32000 protein sequences and also contains 1023 splice variants.
>Bpen|evm.model.Contig148.21 <===(splice variant number 1 has no "." extensions)((I want this for example))
MTKSFKDELGEGGFGTVFKGTLRSGRLVAIKMLGKSKTNGQDFINEVATIGRIHHVNVVQ
LIGFCVEGSKRALVYEFMPNGSLNKHIFLPEISALLSYDKMYDIALGILHFDIKPHNILL
DENFTPKVSDFGLAKLYPVNDNIVYLTAVRGTLGYMAPELFYKNIGGVSFKADVYSFGMLLMEMAGRRKNLNAFAEHSSQIYFPTWVYDQLNDGNDIEMEDAIEEEKKKGKKMIIVALWC
IQMKPSDRPSMNKVVQMLEGEVECLQMPSKPSLSSLESIIAAASIFYNLSSPPLTQASLF
LITHIEAYIPLHSP
>Bpen|evm.model.Contig148.21.1 <===(splice variant number 2)
MTKSFKDELGEGGFGTVFKGTLRSGRLVAIKMLGKSKTNGLLMEMAGRRKNLN
>Bpen|evm.model.Contig148.21.2 <===(splice variant number 3)
MTKSFKDELGEGGFGTVFKGSGRLVAIKMLGKSKTNGQDFINEVATIGRIHHVNVVQLIG
SKRALVYEFMPNGNFTPKVSDFGLAKLLTAVRGTLGYMAPELFYKNIGGVSFKADVYSFG
MLLMEMAGRR
>Bpen|evm.model.Contig148.21.3 <===(splice variant number 4)
MTKSFKDELGEGGFGRSGRLVAIKMLGKSKTNGQDFINEVATIGRIHIGFCVEGSKRALV
LNKHIFLPYDIALGILHFDIKNFTPKVLYPVNYGYMAPGVFGMLLMEMAGRRKNLN
How can I search for splice variant patterns in all headers, read the sequences, and report the one with the longest prot. sequences. Just to mention, the patterns of long and short sequences is different in different splice variants; some time splice variant 1 has the longest, some time 2 and so on.
I appreciate any help, no matter of what ways or programming languages.
Cheers
Thanks for the guide. I definitely like to learn programming even though I new in this field, but I am doing my best with slow progress.