How to pick the longest splice variants sequence from a fasta file?
1
0
Entering edit mode
9.5 years ago
Jaan • 0

Hi every one,

I have a data example like follow, and I have to select the splice variant which has the longest prot. sequences and remove the rest from my.fasta file. my.fasta file has 32000 protein sequences and also contains 1023 splice variants.

>Bpen|evm.model.Contig148.21 <===(splice variant number 1 has no "." extensions)((I want this for example))
MTKSFKDELGEGGFGTVFKGTLRSGRLVAIKMLGKSKTNGQDFINEVATIGRIHHVNVVQ
LIGFCVEGSKRALVYEFMPNGSLNKHIFLPEISALLSYDKMYDIALGILHFDIKPHNILL
DENFTPKVSDFGLAKLYPVNDNIVYLTAVRGTLGYMAPELFYKNIGGVSFKADVYSFGMLLMEMAGRRKNLNAFAEHSSQIYFPTWVYDQLNDGNDIEMEDAIEEEKKKGKKMIIVALWC
IQMKPSDRPSMNKVVQMLEGEVECLQMPSKPSLSSLESIIAAASIFYNLSSPPLTQASLF
LITHIEAYIPLHSP

>Bpen|evm.model.Contig148.21.1 <===(splice variant number 2)
MTKSFKDELGEGGFGTVFKGTLRSGRLVAIKMLGKSKTNGLLMEMAGRRKNLN

>Bpen|evm.model.Contig148.21.2 <===(splice variant number 3)
MTKSFKDELGEGGFGTVFKGSGRLVAIKMLGKSKTNGQDFINEVATIGRIHHVNVVQLIG
SKRALVYEFMPNGNFTPKVSDFGLAKLLTAVRGTLGYMAPELFYKNIGGVSFKADVYSFG
MLLMEMAGRR

>Bpen|evm.model.Contig148.21.3 <===(splice variant number 4)
MTKSFKDELGEGGFGRSGRLVAIKMLGKSKTNGQDFINEVATIGRIHIGFCVEGSKRALV
LNKHIFLPYDIALGILHFDIKNFTPKVLYPVNYGYMAPGVFGMLLMEMAGRRKNLN

How can I search for splice variant patterns in all headers, read the sequences, and report the one with the longest prot. sequences. Just to mention, the patterns of long and short sequences is different in different splice variants; some time splice variant 1 has the longest, some time 2 and so on.

I appreciate any help, no matter of what ways or programming languages.

Cheers

gene genome R next-gen-sequencing • 2.7k views
ADD COMMENT
1
Entering edit mode
9.5 years ago

Just use biopython or bioperl, depending on whether you know python or perl. If the file is sorted such that transcripts are grouped together, then you can perform this in a single pass. The general steps, then, would be:

  1. Create a placeholder for the last_gene, last_gene_length and last_gene_record.
  2. For each record, extract the gene ID
  3. If this is the same as last_gene, then see which has a longer length
  4. If the current one is longer replace everything from step #1
  5. If step #3 finds a different gene, write last_gene_record to a file.
  6. Iterate until completion, not forgetting to write out the last record.

It's not unreasonable to expect you to be able to write a program to do that. You don't even need anything other than biopython/bioperl and base functions.

ADD COMMENT
0
Entering edit mode

Thanks for the guide. I definitely like to learn programming even though I new in this field, but I am doing my best with slow progress.

ADD REPLY

Login before adding your answer.

Traffic: 1973 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6