Repeat masking for genome annotation
0
1
Entering edit mode
2.8 years ago
liorglic ★ 1.4k

I am working on a (plant) genome annotation pipeline and would like some advice regarding repeat masking. My pipeline consists of running several ab-initio gene prediction tools (Augustus, GlimmerHMM and SNAP) + transcript alignment (PASA) + protein alignment (genomeThreader) evidence + gene liftover (liftoff), and finally generating gene models using EvidenceModeler.
I am wondering about the best way to go about repeat masking within this pipeline. Specifically, my questions are:

  1. When should I do it - should the masking be done right at the beginning, before running any ab-initio or alignment tool? Alternatively, maybe I should generate gene models on the un-masked genome and only intersect gene models with repeat annotations at the end and filter using a more sophisticated method?
  2. Should I apply hard or soft masking?
  3. What software should I use? I see for instance that EDTA can be used for TE detection, but should I also use a tool like RepeatMasker for other types of repetitive elements, or is this redundant in some way?

I should mention that my main focus is protein coding genes, and I'm not so interested in TE annotation and classification at this point.
Any suggestion or advice is welcome. Thank you!

masking annotation repeat • 1.4k views
ADD COMMENT
0
Entering edit mode

@liorglic Curious if you are able to figure out on tools at this point?

ADD REPLY
0
Entering edit mode

Still not much of an expert, but I think masking at the beginning is the way to go. Running EDTA and RepeatMasker should do the trick, but honestly I'm not sure my advice is very reliable...

ADD REPLY
0
Entering edit mode

Thanks. appreciate your feedback. For genomeThreader (GT), I was wondering how you handled it speed. Its seems relatively very slow. Is there a way to speed it up, as I couldn't find much information online related to speeding up the GT.

ADD REPLY
0
Entering edit mode

I think the best you can do is just slice the genome into windows of fixed size and let GT work on each of them separately, then combine everything at the end. This way you can parallelize the work. There may also be newer/better alternatives for GT, but I am not aware of them.

ADD REPLY

Login before adding your answer.

Traffic: 2842 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6