Hello,
I am annotating the genome of a non-model bird using the Maker2 pipeline and I have a question about filtering the RNA input.
Since transcriptomes were not available for my species, I downloaded RNA-seq reads from the closest relative on NCBI and built a transcriptome with Trinity. The output contains thousands of contigs (275,967 "Trinity genes", 542,886 contigs, contig N50=687) which is clearly many more than the number of real genes (I would expect ~20,000), but I have heard that it is normal to get many more "Trinity genes" than there are real genes. My question is whether I should filter this transcriptome (by rpkm for example) to reduce its size or if it is better to provide the entire set as RNA evidence for Maker2.
I started Maker2 with the entire Trinity output as the RNA input and it has been running for over a month, and I suspect the slow tBLASTx could be one of the bottlenecks as it runs through the thousands of transcripts. Is it recommended to reduce the size of the RNA library input or would that cause problems of excluding some real transcripts from the annotation? I would like to speed it up for future annotations but I don't want to sacrifice the quality of the annotation.
Thank you.
Use a guided assembly then, it would create much less false positive.
Whatever transcriptome you end up constructing should be inputted as
altest=
and notest=
in themaker_opts.ctl
file if the transcriptome comes from a different species.In case it is an evidence-based annotation (
est2genome=1
), I would suggest to use both because they do not use the same e-value cutoff. MAKER doesn't create any gene models fromaltest=
option, it is just used to add UTRs. So if they map quite well for some of them it would be pity to not use them to create gene models.I realise that I don't know if
altest=
data is used to create hints for the ab-initio predictors. Something I would like to know.Oh, I did not know that! I have been using
altest=
and assuming that it was being used to create hints. Sadly the closest RNA reads are from a fairly distant organism (~35 million years diverged) so I'm not sureest=
would work well.I checked on the MAKER mailing list lt lools like
altest
creates hints for ab initio gene predictors (Carson said it is used to anchor gene prediction). It is also used to add UTRs. But he clearly says that it is not used to create gene models.Thanks! I will look at guided assembly. Unfortunately many of the species I am assembling do not have a close reference genome yet or I think it is too fragmented for guided assembly.