Hello,
I am new to the genome annotation and I'm lost with the interpretation of results produced by BRAKER.
- I have a de novo assembly of an insect genome (N50 = 350kb, length = 1.9 Gb).
- I masked the repeats using RepeatModeler and RepeatMasker.
- I mapped the RNA-Seq data from the same species to the (hard) masked genome with HISAT2.
- I used BRAKER to annotate my (soft masked) genome with the bam file produced by HISAT2.
I have 54000 entries in the resulting augustus.hints.gff file. That means that Augustus predicted 54k genes, right? We expect to have between 10k and 20k genes for our species, so I would like to understand why there are so many genes in our prediction.
Among these 54k entries, 38k entries contain the following information:
# % of transcript supported by hints (any source): 0
Does it mean that these predictions are of poor quality and I should only keep predictions with a significant %?
Any other suggestions on how to enhance the annotation of my genome are welcome!
that can not be the only output file, no? Can you check what the numbers in the fasta (output) files are?
also: what was the exact command you executed?
Thank you for your reply!
I also have a fasta file with AA and another with coding sequences, both containing 54 k genes.
The command I executed is: