I am annotating the genome of a non-model bird using the Maker2 pipeline and I have a question about filtering the RNA input.
Since transcriptomes were not available for my species, I downloaded RNA-seq reads from the closest relative on NCBI and built a transcriptome with Trinity. The output contains thousands of contigs (275,967 "Trinity genes", 542,886 contigs, contig N50=687) which is clearly many more than the number of real genes (I would expect ~20,000), but I have heard that it is normal to get many more "Trinity genes" than there are real genes. My question is whether I should filter this transcriptome (by rpkm for example) to reduce its size or if it is better to provide the entire set as RNA evidence for Maker2.
I started Maker2 with the entire Trinity output as the RNA input and it has been running for over a month, and I suspect the slow tBLASTx could be one of the bottlenecks as it runs through the thousands of transcripts. Is it recommended to reduce the size of the RNA library input or would that cause problems of excluding some real transcripts from the annotation? I would like to speed it up for future annotations but I don't want to sacrifice the quality of the annotation.