I need to predict pseudogenes in an eukaryotic species, I have the genome sequences. The gene prediction have been done already.
I have learned that there are two programs named Pseudopipe (paper) and PPFINDER. However, the Pseudopipe need some information from Ensemble mysql database. The question is that I have not use Ensemble pipeline to predict genes. I have problems in use the PPFINDER too, which need synteny information with another species.
I want to that if there exists any convenient way to predict pseudogene. Thank you!
I have never used any of these programs but some simple methods can help you predict some pseudogenes. Pseudogenes emerge from protein-coding genes. Then you can scan your genome for sequences homologous to the known protein-coding genes it contains (basically looking for non annotated paralogous sequences). This can be done using tools such as BLAST, BLAT,...
Once this is done you have to see if these sequences have kept their protein-coding abilities and, typically, look for frameshifts or ORF disruptions which can be due to mutations of start/stop codons or indels. A sequence homologous to a protein-coding gene but without ORF is likely to be a pseudogene.
You can then extend your analysis including genes from other species. A pseudogene can be a single copy gene (without paralogs) that has been lost in a given species. The method is a same except that you look at orthologous rather than paralogous sequences and then scan your genome of interest for potential sequences homologous to protein coding genes present in more or less closely related species. An addition to this method is to look for synteny. If the sequence is quite degenerated but the flanking regions (ideally containing several protein-coding genes) are conserved this will give more confidence to you detection.
A good confirmation of pseudogenes is to look at dN/dS of this sequences. Pseudogenes generally evolve under neutral selection and then display a dN/dS close to 1. Nonetheless this might be affected by the time this gene has been decaying.
I let you a reference of a paper which use a partially similar method (it is a bit more complex). It is focusing on only one gene but the approach is usable at a large scale.
Also, looking at RNASeq data might be helpful. You can find out if a gene is transcribed or not - at least in the tissue of your sample!!! It is also useful since you can find if the annotation you are using is accurate or not.