Question: How to pick the longest splice variants sequence from a fasta file?
0
gravatar for Jaan
6.1 years ago by
Jaan0
Finland
Jaan0 wrote:

Hi every one,

I have a data example like follow, and i have to select the splice variant which has the longest prot. sequnces and remove the rest from my.fasta file. my.fasta file has 32000 protein sequences and also contains 1023 splice variants.

>Bpen|evm.model.Contig148.21 <===(splice variant number 1 has no "." extensions)((I want this for example))
MTKSFKDELGEGGFGTVFKGTLRSGRLVAIKMLGKSKTNGQDFINEVATIGRIHHVNVVQ
LIGFCVEGSKRALVYEFMPNGSLNKHIFLPEISALLSYDKMYDIALGILHFDIKPHNILL
DENFTPKVSDFGLAKLYPVNDNIVYLTAVRGTLGYMAPELFYKNIGGVSFKADVYSFGMLLMEMAGRRKNLNAFAEHSSQIYFPTWVYDQLNDGNDIEMEDAIEEEKKKGKKMIIVALWC
IQMKPSDRPSMNKVVQMLEGEVECLQMPSKPSLSSLESIIAAASIFYNLSSPPLTQASLF
LITHIEAYIPLHSP

>Bpen|evm.model.Contig148.21.1 <===(splice variant number 2)
MTKSFKDELGEGGFGTVFKGTLRSGRLVAIKMLGKSKTNGLLMEMAGRRKNLN

>Bpen|evm.model.Contig148.21.2 <===(splice variant number 3)
MTKSFKDELGEGGFGTVFKGSGRLVAIKMLGKSKTNGQDFINEVATIGRIHHVNVVQLIG
SKRALVYEFMPNGNFTPKVSDFGLAKLLTAVRGTLGYMAPELFYKNIGGVSFKADVYSFG
MLLMEMAGRR

>Bpen|evm.model.Contig148.21.3 <===(splice variant number 4)
MTKSFKDELGEGGFGRSGRLVAIKMLGKSKTNGQDFINEVATIGRIHIGFCVEGSKRALV
LNKHIFLPYDIALGILHFDIKNFTPKVLYPVNYGYMAPGVFGMLLMEMAGRRKNLN

How can i search for splice variant patterns in all headers, read the sequences, and report the one with the longest prot. sequences. Just to mention, the patterns of long and short sequences is different in different splice variants; some time splice variant 1 has the longest, some time 2 and so on.

I appreciate any help, no matter of what ways or programming languages.

Cheers

genome sequencing next-gen R gene • 2.1k views
ADD COMMENTlink modified 6.1 years ago by Devon Ryan97k • written 6.1 years ago by Jaan0
1
gravatar for Devon Ryan
6.1 years ago by
Devon Ryan97k
Freiburg, Germany
Devon Ryan97k wrote:

Just use biopython or bioperl, depending on whether you know python or perl. If the file is sorted such that transcripts are grouped together, then you can perform this in a single pass. The general steps, then, would be:

  1. Create a placeholder for the last_gene, last_gene_length and last_gene_record.
  2. For each record, extract the gene ID
  3. If this is the same as last_gene, then see which has a longer length
  4. If the current one is longer replace everything from step #1
  5. If step #3 finds a different gene, write last_gene_record to a file.
  6. Iterate until completion, not forgetting to write out the last record.

It's not unreasonable to expect you to be able to write a program to do that. You don't even need anything other than biopython/bioperl and base functions.

ADD COMMENTlink written 6.1 years ago by Devon Ryan97k

Thanks for the guide. I definitely like to learn programming even though i new in this field, but i am doing my best with slow progress.  

ADD REPLYlink written 6.1 years ago by Jaan0
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1745 users visited in the last hour