Gene prediction is one of the most common tasks in bioinformatic analysis of newly sequenced genomes. AUGUSTUS is an excellent gene prediction tool which works with eukaryotic genomes. It allows to predict genes ab initio (de novo) or based on some hints (e.g. RNA-seq/EST, protein alignments, synthetic genomic alignment). In this tutorial we explain how to use protein profiles to improve gene search in the genomic fasta files. For this purpose, we discuss AUGUSTUS protein profile extension (PPX) and explain all steps necessary to run a prediction with an addition of a protein profile.
PPX extension allows to supplement gene prediction procedure with the information about protein family conservation. Information about protein family conservation normally comes from so called protein block profiles. Normally, protein profile files contain position-specific frequency matrices that model conserved regions in a multiple sequence alignment (MSA) with no indels. When PPX extension is used for gene prediction, those genes that match provided profiles are predicted with much higher prediction accuracy then the rest of the genes predicted ab-initio.