Hey there,
I am currently trying to write a parser for the output of the SIM4 tool.
My intention is to use SIM4 to align mRNA to its respective gene. To be more specific, I want to run sim4 on two sequences (mRNA.fas and gen.fas), save the output and use my parser to get the alignment file in plain fasta format.
However, the output of sim4 is not trivial to parse (for me). Especially I do not understand why in the first part of the output (giving the exon endpoints) there are different coordinates given than in the second part (giving the alignment)? For example, when you look at the last line of the first part the start coordinate is 14039, while in the alignment the first coordinate is given as 14054. Where does this difference come from??
At http://pipmaker.bx.psu.edu/pipmaker/pip-instr.html there is a description of the output format, but it does not go into details very much.
There is a Bioperl parser for sim4. I haven't tried it, although I tried to understand it and "copy" it to python. But unfortunately I do not know that much Perl and find it difficult to understand the Bioperl code, since the information is "hidden" in so many different kind of objects. Of course, when it is to cumbersome to get my own parser, I will dig deeper into how to use this Bioperl module.
I have 2 questions:
- Does anyone already have a python script that parses sim4 output and would be so kind as to share it with me?
- How is the sim4 output defined??
Here is an example of how the output of sim4 looks like:
###First Part (with exon endpoints)
>gen; LEN=15595
>mRNA; LEN=1036
37-53 (29-45) 100% ==
1221-1235 (101-115) 100% ==
1685-1842 (191-353) 87% ==
2073-2098 (438-463) 92% -> (GT/AG) 22
14039-14598 (464-1024) 83%
###Second Part (with alignment) 0 . : . 37 GCTGCAAACTGGCCTCT ||||||||||||||||| 29 GCTGCAAACTGGCCTCT
0 . : .
1221 AGGAGAAGAAAAAAA
|||||||||||||||
101 AGGAGAAGAAAAAAA
0 . : . : . : . : . :
1685 CAAAGGTCTTAAGTAGGGGTCCATAAACTACTGGAAATTTTATGCACAAT
||||||||||||||||| | | |||||||| |||||||||||||||||||
191 CAAAGGTCTTAAGTAGGTGACTATAAACTATTGGAAATTTTATGCACAAT
50 . : . : . : . : . :
1735 TATACATTTTTGTGGGAAGAAGAAGGTCCATAGTATCAG A TTCTCAA
|| |||||||||| | |||||||||||||||| ||||||-|--|||||||
241 TACACATTTTTGTAGAAAGAAGAAGGTCCATAATATCAGCAGATTCTCAA
100 . : . : . : . : . :
1782 AGAGGACACTGTTTCATAAAAGCTTGAGAACCACTGCTCCAAGAC T TG
| |||||| |||| |||||||||||||||||| || ||| |||-|-||
291 ACAGGACAGTGTTCCATAAAAGCTTGAGAACCGCTTCTCTGCGACTTCTG
150 . :
1830 ACATCTTTTTAAA
|||||||||||||
341 ACATCTTTTTAAA
0 . : . : . : . : . :
2073 AAAATTCTGAACTCAGGACTTGGCAAGTA...TAGGTGTCTCTATGTTGT
|||||||||||||| |||| ||||||>>>...>>>|| ||||||||||||
438 AAAATTCTGAACTCGGGACGTGGCAA GTATCTCTATGTTGT
50 . : . : . : . : . :
14054 CTCCTAGAGTGGGTAGTCCTGCTTCTTTTACCCAGTTACTTTCCCGT TT
|||||| || ||| |||||||||||||||| |||||||||||| |-|-||
479 CTCCTAAAGCGGGCAGTCCTGCTTCTTTTATCCAGTTACTTTCTC TCTT
100 . : . : . : . : . :
14103 TTGAAATGTGGCTATCACTTCTCTACACATTACCTCCATGATTTGGAAT
-||||||||||| | || ||||||||||||||||||||||| |||||||
528 CTTGAAATGTGGTTGTCTCTTCTCTACACATTACCTCCATGGCTTGGAAT
150 . : . : . : . : . :
14152 GGAAAAGGCCACTTTTCTTTTTGTTCTGCCTCTCAAATTCAACACAGAGG
|||||||||||||||||||||||||| | ||||| |||||||||||||-
578 GGAAAAGGCCACTTTTCTTTTTGTTCCACGTCTCAGATTCAACACAGAG
200 . : . : . : . : . :
14202 A GCTCCTAGGATTCCAGTTATCCTGCT AACATCTCCAGGAAGAAAGAA
|-|| |||||| || |||||||||||-|-||| |||||||||||||| ||
627 ATGCCCCTAGGGTTGCAGTTATCCTG TCAACTTCTCCAGGAAGAAAAAA
250 . : . : . : . : . :
14250 GCAACTCA CATGGGTCTTT TGCTG T TTG CTTAATTATAAAGACAT
||||||-|-||| |||-|||-|||||-|-|| -|||||||||||||||||
676 GCAACT AGCATAGGT TTTCTGCTGCTATTTCCTTAATTATAAAGACAT
300 . : . : . : . : . :
14295 CATTTTGCAAGCAGAAGGCTGA GTTTCATTTGAAACAGGTGCTTAGGTG
||||||||||| |||| ||||-||||||||||||||||| |||||||||
724 TATTTTGCAAGCTGAAGACTGATGTTTCATTTGAAACAGGGGCTTAGGTG
350 . : . : . : . : . :
14344 GTGGTATTTGTGAAT ACTTTTCATTCCAAGCAAGAAGACTAAAGA
|| ||||||||||||-----|||||| |||||| ||||||||||||| |
774 GTAGTATTTGTGAATTATTTACTTTTTGTTCCAAACAAGAAGACTAAATA
400 . : . : . : . : . :
14389 AGTAGCAAGTATGAATGACTTCAGGGTTTAAAAAAAATGTCTTC CAGTT
| |||||| ||| || |||||||| ||| |||||||||-||-|
824 AACAGCAAGCATGGATAACTTCAGGATTTTTTTTTAATGTCTTCTCA TG
450 . : . : . : . : . :
14438 TCAGCCACTACCATGATAAGCACAGTTGAGACTGCAGCAGTAAATTCCAA
|||||||-|----|-||---|||----|||||| ||||||||||||||||
873 TCAGCCA T T AT CAC GAGACTACAGCAGTAAATTCCAA
500 . : . : . : . : . :
14488 ATATGTGTTTCTAATTTGACGTGAAAGATACTAAAAA TT TATATTT
||||||| ||||||||||| ||||||||||||||||-||---|||||||
910 ATATGTGGTTCTAATTTGAAATGAAAGATACTAAAAAGTTACATATATTT
550 . : . : . : . : . :
14534 GTATATTTAAATCCTGGCTCATCCTGTGACATAGATTTACTGAATAGGAA
|||| |||||||||||||||| ||| ||| ||||||||||||| |||
960 ACATATCTAAATCCTGGCTCATCTTGTAACACGGATTTACTGAATAAGAA
600 . : .
14584 CAAAGGCCCAATTTT
|||||| ||||||||
1010 CAAAGGTCCAATTTT
I would be glad to read some hints/advice!