I am new to the genome annotation and I'm lost with the interpretation of results produced by BRAKER.
- I have a de novo assembly of an insect genome (N50 = 350kb, length = 1.9 Gb).
- I masked the repeats using RepeatModeler and RepeatMasker.
- I mapped the RNA-Seq data from the same species to the (hard) masked genome with HISAT2.
- I used BRAKER to annotate my (soft masked) genome with the bam file produced by HISAT2.
I have 54000 entries in the resulting augustus.hints.gff file. That means that Augustus predicted 54k genes, right? We expect to have between 10k and 20k genes for our species, so I would like to understand why there are so many genes in our prediction.
Among these 54k entries, 38k entries contain the following information:
# % of transcript supported by hints (any source): 0
Does it mean that these predictions are of poor quality and I should only keep predictions with a significant %?
Any other suggestions on how to enhance the annotation of my genome are welcome!