Question

Detecting transposable elements in both assembled genomes and raw reads

0

Entering edit mode

8.6 years ago

Pryce Michener ▴ 10

I'm trying to detect and identify transposable elements in my plant genomes, and I'm having some trouble finding the best programs and pipelines to use. The major reviews of these programs all seem to have come out 5-8 years ago, and I wasn't able to find anything that covered newer programs since then. Does anyone have any experience with finding TEs and would be able to set me in the right direction?

The current plan was to use Tedna and RepeatModeler to detect TEs from our raw read files, but we have de novo assembled genomes that I would like to investigate as well. I would like to run a few different programs and get a good consensus so that I can eliminate false positives that individual programs might put out.

transposable-elements TEs transposons • 3.0k views

ADD COMMENT • link updated 19 months ago by Ram 43k • written 8.6 years ago by Pryce Michener ▴ 10

Ram · Answer 1 · 2015-09-15

I can definitely help point you in the right direction but it would also help to know some background on what you are trying to accomplish. In general, TE-finding programs are based on some combination of 1) mathematical repeat patterns (k-mer frequency), 2), similarity to some reference database, 3) clustering based on a self-comparison of the data set, or 4) structural features (LTRs, TIRs, etc.). I would say those approaches are in order of complexity to perform, and also in order of how biologically relevant they are.

Transposome was designed for characterizing TE abundance/diversity from raw reads, and it performs very well in terms of accuracy on plant genomes (an example with maize is presented in the paper). I'm the author so I could answer any questions related to the usage. Transposome is based on a clustering approach with the annotations being assigned from a repeat database.

For identifying TEs from an assembled genome you need to think about what type of TE you are interested in. There are many different programs I use for this task with each program being designed for one specific type of TE (based on the structural features). Programs like Recon and RepeatModeler are based on k-mer frequencies, and the goal of RepeatModeler is to try and construct a TE from k-mers. The result is going to be a contig representing the most frequently occurring parts of the element in the genome. Usually this will be the internal coding region because this is more conserved than the flanking repeats. What you get is not a real transposon with single locus, rather it is just a representative of what repeats are found in the genome. This approach can still be useful if you know exactly what you are trying to find out (e.g., quick survey or quick comparison of species). If you want a high quality reference set of TEs for your genome, then I would strongly warn against this approach because the output is not composed of real transposons (and therefore not particularly useful for evolutionary analyses).