Parser For Sim4 Output To Create Fasta Alignment (Preferably Python) Or: How Is Sim4 Output Formatted?
1
1
Entering edit mode
11.1 years ago
Tgh ▴ 10

Hey there,

I am currently trying to write a parser for the output of the SIM4 tool.

My intention is to use SIM4 to align mRNA to its respective gene. To be more specific, I want to run sim4 on two sequences (mRNA.fas and gen.fas), save the output and use my parser to get the alignment file in plain fasta format.

However, the output of sim4 is not trivial to parse (for me). Especially I do not understand why in the first part of the output (giving the exon endpoints) there are different coordinates given than in the second part (giving the alignment)? For example, when you look at the last line of the first part the start coordinate is 14039, while in the alignment the first coordinate is given as 14054. Where does this difference come from??

At http://pipmaker.bx.psu.edu/pipmaker/pip-instr.html there is a description of the output format, but it does not go into details very much.

There is a Bioperl parser for sim4. I haven't tried it, although I tried to understand it and "copy" it to python. But unfortunately I do not know that much Perl and find it difficult to understand the Bioperl code, since the information is "hidden" in so many different kind of objects. Of course, when it is to cumbersome to get my own parser, I will dig deeper into how to use this Bioperl module.

I have 2 questions:

  1. Does anyone already have a python script that parses sim4 output and would be so kind as to share it with me?
  2. How is the sim4 output defined??

Here is an example of how the output of sim4 looks like:

###First Part (with exon endpoints)

>gen; LEN=15595
>mRNA; LEN=1036

37-53  (29-45)   100% ==
1221-1235  (101-115)   100% ==
1685-1842  (191-353)   87% ==
2073-2098  (438-463)   92% -> (GT/AG) 22
14039-14598  (464-1024)   83%

###Second Part (with alignment) 0 . : . 37 GCTGCAAACTGGCCTCT ||||||||||||||||| 29 GCTGCAAACTGGCCTCT

0     .    :    .
1221 AGGAGAAGAAAAAAA
    |||||||||||||||
101 AGGAGAAGAAAAAAA

0     .    :    .    :    .    :    .    :    .    :
1685 CAAAGGTCTTAAGTAGGGGTCCATAAACTACTGGAAATTTTATGCACAAT
    ||||||||||||||||| | | |||||||| |||||||||||||||||||
191 CAAAGGTCTTAAGTAGGTGACTATAAACTATTGGAAATTTTATGCACAAT

50     .    :    .    :    .    :    .    :    .    :
1735 TATACATTTTTGTGGGAAGAAGAAGGTCCATAGTATCAG A  TTCTCAA
    || |||||||||| | |||||||||||||||| ||||||-|--|||||||
241 TACACATTTTTGTAGAAAGAAGAAGGTCCATAATATCAGCAGATTCTCAA

100     .    :    .    :    .    :    .    :    .    :
1782 AGAGGACACTGTTTCATAAAAGCTTGAGAACCACTGCTCCAAGAC T TG
    | |||||| |||| |||||||||||||||||| || |||   |||-|-||
291 ACAGGACAGTGTTCCATAAAAGCTTGAGAACCGCTTCTCTGCGACTTCTG

150     .    :
1830 ACATCTTTTTAAA
    |||||||||||||
341 ACATCTTTTTAAA

0     .    :    .    :    .    :    .    :    .    :
2073 AAAATTCTGAACTCAGGACTTGGCAAGTA...TAGGTGTCTCTATGTTGT
    |||||||||||||| |||| ||||||>>>...>>>|| ||||||||||||
438 AAAATTCTGAACTCGGGACGTGGCAA         GTATCTCTATGTTGT

50     .    :    .    :    .    :    .    :    .    :
14054 CTCCTAGAGTGGGTAGTCCTGCTTCTTTTACCCAGTTACTTTCCCGT TT
    |||||| || ||| |||||||||||||||| |||||||||||| |-|-||
479 CTCCTAAAGCGGGCAGTCCTGCTTCTTTTATCCAGTTACTTTCTC TCTT

100     .    :    .    :    .    :    .    :    .    :
14103  TTGAAATGTGGCTATCACTTCTCTACACATTACCTCCATGATTTGGAAT
    -||||||||||| | || |||||||||||||||||||||||  |||||||
528 CTTGAAATGTGGTTGTCTCTTCTCTACACATTACCTCCATGGCTTGGAAT

150     .    :    .    :    .    :    .    :    .    :
14152 GGAAAAGGCCACTTTTCTTTTTGTTCTGCCTCTCAAATTCAACACAGAGG
    ||||||||||||||||||||||||||  | ||||| |||||||||||||-
578 GGAAAAGGCCACTTTTCTTTTTGTTCCACGTCTCAGATTCAACACAGAG 

200     .    :    .    :    .    :    .    :    .    :
14202 A GCTCCTAGGATTCCAGTTATCCTGCT AACATCTCCAGGAAGAAAGAA
    |-|| |||||| || |||||||||||-|-||| |||||||||||||| ||
627 ATGCCCCTAGGGTTGCAGTTATCCTG TCAACTTCTCCAGGAAGAAAAAA

250     .    :    .    :    .    :    .    :    .    :
14250 GCAACTCA CATGGGTCTTT TGCTG T TTG CTTAATTATAAAGACAT
    ||||||-|-||| |||-|||-|||||-|-|| -|||||||||||||||||
676 GCAACT AGCATAGGT TTTCTGCTGCTATTTCCTTAATTATAAAGACAT

300     .    :    .    :    .    :    .    :    .    :
14295 CATTTTGCAAGCAGAAGGCTGA GTTTCATTTGAAACAGGTGCTTAGGTG
    ||||||||||| |||| ||||-||||||||||||||||| |||||||||
724 TATTTTGCAAGCTGAAGACTGATGTTTCATTTGAAACAGGGGCTTAGGTG

350     .    :    .    :    .    :    .    :    .    :
14344 GTGGTATTTGTGAAT     ACTTTTCATTCCAAGCAAGAAGACTAAAGA
    || ||||||||||||-----||||||  |||||| ||||||||||||| |
774 GTAGTATTTGTGAATTATTTACTTTTTGTTCCAAACAAGAAGACTAAATA

400     .    :    .    :    .    :    .    :    .    :
14389 AGTAGCAAGTATGAATGACTTCAGGGTTTAAAAAAAATGTCTTC CAGTT
    |  |||||| ||| || |||||||| |||      |||||||||-||-| 
824 AACAGCAAGCATGGATAACTTCAGGATTTTTTTTTAATGTCTTCTCA TG

450     .    :    .    :    .    :    .    :    .    :
14438 TCAGCCACTACCATGATAAGCACAGTTGAGACTGCAGCAGTAAATTCCAA
    |||||||-|----|-||---|||----|||||| ||||||||||||||||
873 TCAGCCA T    T AT   CAC    GAGACTACAGCAGTAAATTCCAA

500     .    :    .    :    .    :    .    :    .    :
14488 ATATGTGTTTCTAATTTGACGTGAAAGATACTAAAAA TT   TATATTT
    ||||||| |||||||||||  ||||||||||||||||-||---|||||||
910 ATATGTGGTTCTAATTTGAAATGAAAGATACTAAAAAGTTACATATATTT

550     .    :    .    :    .    :    .    :    .    :
14534 GTATATTTAAATCCTGGCTCATCCTGTGACATAGATTTACTGAATAGGAA
    |||| |||||||||||||||| ||| |||  ||||||||||||| |||
960 ACATATCTAAATCCTGGCTCATCTTGTAACACGGATTTACTGAATAAGAA

600     .    :    .
14584 CAAAGGCCCAATTTT
    |||||| ||||||||
1010 CAAAGGTCCAATTTT

I would be glad to read some hints/advice!

msa alignment fasta parsing python • 3.0k views
ADD COMMENT
0
Entering edit mode
11.0 years ago
Hamish ★ 3.2k

The sim4 alignment format is briefly described in the accompanying documentation (see http://globin.bx.psu.edu/html/docs/sim4.html).

In your sample alignment the last two parts of the alignment (2073-2098 & 14039-14598) are joined by a splice site. sim4 indicates this in the alignment using the ">>>...>>>" symbols in the consensus line, and reports them as a single alignment. Thus the '14054' coordinate reported is in fact part way through the alignment for this exon, Counting the segment for this exon in the previous section of the alignment, the start coordinate for this exon is 15 bases earlier, which agrees with the coordinates stated in the summary.

ADD COMMENT

Login before adding your answer.

Traffic: 2569 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6