5 weeks ago
liorglic ▴ 410

I am working on a (plant) genome annotation pipeline and would like some advice regarding repeat masking. My pipeline consists of running several ab-initio gene prediction tools (Augustus, GlimmerHMM and SNAP) + transcript alignment (PASA) + protein alignment (genomeThreader) evidence + gene liftover (liftoff), and finally generating gene models using EvidenceModeler.
I am wondering about the best way to go about repeat masking within this pipeline. Specifically, my questions are:

  1. When should I do it - should the masking be done right at the beginning, before running any ab-initio or alignment tool? Alternatively, maybe I should generate gene models on the un-masked genome and only intersect gene models with repeat annotations at the end and filter using a more sophisticated method?
  2. Should I apply hard or soft masking?
  3. What software should I use? I see for instance that EDTA can be used for TE detection, but should I also use a tool like RepeatMasker for other types of repetitive elements, or is this redundant in some way?

I should mention that my main focus is protein coding genes, and I'm not so interested in TE annotation and classification at this point.
Any suggestion or advice is welcome. Thank you!

