Question

How to pick the longest splice variants sequence from a fasta file?

0

Entering edit mode

9.5 years ago

Jaan • 0

Hi every one,

I have a data example like follow, and I have to select the splice variant which has the longest prot. sequences and remove the rest from my.fasta file. my.fasta file has 32000 protein sequences and also contains 1023 splice variants.

>Bpen|evm.model.Contig148.21 <===(splice variant number 1 has no "." extensions)((I want this for example))
MTKSFKDELGEGGFGTVFKGTLRSGRLVAIKMLGKSKTNGQDFINEVATIGRIHHVNVVQ
LIGFCVEGSKRALVYEFMPNGSLNKHIFLPEISALLSYDKMYDIALGILHFDIKPHNILL
DENFTPKVSDFGLAKLYPVNDNIVYLTAVRGTLGYMAPELFYKNIGGVSFKADVYSFGMLLMEMAGRRKNLNAFAEHSSQIYFPTWVYDQLNDGNDIEMEDAIEEEKKKGKKMIIVALWC
IQMKPSDRPSMNKVVQMLEGEVECLQMPSKPSLSSLESIIAAASIFYNLSSPPLTQASLF
LITHIEAYIPLHSP

>Bpen|evm.model.Contig148.21.1 <===(splice variant number 2)
MTKSFKDELGEGGFGTVFKGTLRSGRLVAIKMLGKSKTNGLLMEMAGRRKNLN

>Bpen|evm.model.Contig148.21.2 <===(splice variant number 3)
MTKSFKDELGEGGFGTVFKGSGRLVAIKMLGKSKTNGQDFINEVATIGRIHHVNVVQLIG
SKRALVYEFMPNGNFTPKVSDFGLAKLLTAVRGTLGYMAPELFYKNIGGVSFKADVYSFG
MLLMEMAGRR

>Bpen|evm.model.Contig148.21.3 <===(splice variant number 4)
MTKSFKDELGEGGFGRSGRLVAIKMLGKSKTNGQDFINEVATIGRIHIGFCVEGSKRALV
LNKHIFLPYDIALGILHFDIKNFTPKVLYPVNYGYMAPGVFGMLLMEMAGRRKNLN

How can I search for splice variant patterns in all headers, read the sequences, and report the one with the longest prot. sequences. Just to mention, the patterns of long and short sequences is different in different splice variants; some time splice variant 1 has the longest, some time 2 and so on.

I appreciate any help, no matter of what ways or programming languages.

Cheers

gene genome R next-gen-sequencing • 2.7k views

ADD COMMENT • link updated 2.2 years ago by Ram 43k • written 9.5 years ago by Jaan • 0

Ram · Answer 1 · 2014-11-04

Just use biopython or bioperl, depending on whether you know python or perl. If the file is sorted such that transcripts are grouped together, then you can perform this in a single pass. The general steps, then, would be:

Create a placeholder for the last_gene, last_gene_length and last_gene_record.
For each record, extract the gene ID
If this is the same as last_gene, then see which has a longer length
If the current one is longer replace everything from step #1
If step #3 finds a different gene, write last_gene_record to a file.
Iterate until completion, not forgetting to write out the last record.

It's not unreasonable to expect you to be able to write a program to do that. You don't even need anything other than biopython/bioperl and base functions.