Hi every one,
I have a data example like follow, and i have to select the splice variant which has the longest prot. sequnces and remove the rest from my.fasta file. my.fasta file has 32000 protein sequences and also contains 1023 splice variants.
>Bpen|evm.model.Contig148.21 <===(splice variant number 1 has no "." extensions)((I want this for example)) MTKSFKDELGEGGFGTVFKGTLRSGRLVAIKMLGKSKTNGQDFINEVATIGRIHHVNVVQ LIGFCVEGSKRALVYEFMPNGSLNKHIFLPEISALLSYDKMYDIALGILHFDIKPHNILL DENFTPKVSDFGLAKLYPVNDNIVYLTAVRGTLGYMAPELFYKNIGGVSFKADVYSFGMLLMEMAGRRKNLNAFAEHSSQIYFPTWVYDQLNDGNDIEMEDAIEEEKKKGKKMIIVALWC IQMKPSDRPSMNKVVQMLEGEVECLQMPSKPSLSSLESIIAAASIFYNLSSPPLTQASLF LITHIEAYIPLHSP >Bpen|evm.model.Contig148.21.1 <===(splice variant number 2) MTKSFKDELGEGGFGTVFKGTLRSGRLVAIKMLGKSKTNGLLMEMAGRRKNLN >Bpen|evm.model.Contig148.21.2 <===(splice variant number 3) MTKSFKDELGEGGFGTVFKGSGRLVAIKMLGKSKTNGQDFINEVATIGRIHHVNVVQLIG SKRALVYEFMPNGNFTPKVSDFGLAKLLTAVRGTLGYMAPELFYKNIGGVSFKADVYSFG MLLMEMAGRR >Bpen|evm.model.Contig148.21.3 <===(splice variant number 4) MTKSFKDELGEGGFGRSGRLVAIKMLGKSKTNGQDFINEVATIGRIHIGFCVEGSKRALV LNKHIFLPYDIALGILHFDIKNFTPKVLYPVNYGYMAPGVFGMLLMEMAGRRKNLN
How can i search for splice variant patterns in all headers, read the sequences, and report the one with the longest prot. sequences. Just to mention, the patterns of long and short sequences is different in different splice variants; some time splice variant 1 has the longest, some time 2 and so on.
I appreciate any help, no matter of what ways or programming languages.