Question

Should I use Dfam or a custom repetitve elements library (from PlantRep) as a repeat database; using RepeatMasker in a Linux machine?

0

Entering edit mode

4 weeks ago

Vijith ▴ 30

Recently, I completed the assembly of a plant genome. This plant species is a member of the monocot angiosperm group. And, reading the informative response by @SES and @ Andrzej Zielezinski in a post I was convinced of the importance of masking repetitive elements before predicting genes. I have decided to use Repeat Masker for this purpose, as mentioned in the post. Now, coming to the question:

I read that a repeat database needs to be installed and Dfam is an open database of TEs. "a minimal version of Dfam 3.8 ( root partition ) can be downloaded automatically by the configure script. Additional taxa partitions may be downloaded and configured at any time." Incidentally, I came across a database of plant repetitive elements - PlantRep, which has annotated repeats from 459 plant genomes (of which annotated repeats from 70 monocot genomes). So, what is better: downloading the Dfam 3.8 or specifying a custom library that contains the repetitive elements from PlantRep in fasta format?

Any input is highly appreciated.

genome sequence repeatmasker blast • 564 views

ADD COMMENT • link updated 27 days ago by b.contreras.moreira ▴ 170 • written 4 weeks ago by Vijith ▴ 30

score 1 · Answer 1 · 2024-03-26

In https://doi.org/10.1002/tpg2.20143 we found that RepeatMasker underestimated repeat content in plants when using REdat as repeat database. This was due to the database used as results improved with our own custom database nrTEplants. More generally, I would expect RepeatMasker to work well if your genome of interest contains repeats similar to those in your reference database, such as PlantRep or others out there. Note we did not test RepBase, which was the recommended database, as it required subscription at the time. It seems the current academic use agreement might be an option for you.

Anyway, in that study we concluded that repeat masking by k-mer analysis worked well in plants, it's very fast, and does not require a database. You can save a lot of time if you don't really need to annotate the repeats and masking them to guide gene annotation is sufficient. Note that you can still annotate them by sequence similarity afterwards, but this is optional. See our protocol for this at https://github.com/Ensembl/plant-scripts/tree/master/repeats

A highly cited protocol for annotation of plant repeats is https://github.com/oushujun/EDTA, which was described at https://doi.org/10.1186/s13059-019-1905-y

Hope this helps