Question

why only 2 amino acids are used in gene prediction

0

Entering edit mode

6.9 years ago

ssshan • 0

Dear all，

I'm a master candicate who is interested in machine learning with gene prediction. I noticed that most papers would pick dimers (2 amino acids) as a key feature to train positive and negative data sets during gene prediction. However, I don't know why dimers is the only or best option. Anyone could help?

Thanks in advance!

gene alignment sequence • 1.2k views

ADD COMMENT • link updated 6.9 years ago by Andrzej Zielezinski 11k • written 6.9 years ago by ssshan • 0

score 4 · Accepted Answer · 2017-05-29

Hexamers (6 nt long words) are accepted as the most accurate k-mer frequency based measure of coding potential. In 1992, a systematic study of more than twenty compositional properties indicated that hexamer composition gave the best discrimination between coding and non-coding regions (Fickett & Tung, Nucleic Acids Research, 1992). Since that time, reading frame-dependent hexamer frequencies has been the most commonly used content sensor of current gene prediction programs.