Question: Best Way To Identify Families From Individual Transposable Element Copies?
gravatar for David M
7.6 years ago by
David M550
David M550 wrote:

I'm currently involved in an annotation project focused on identifying as many repeats as possible in a recently assembled (1st draft) genome. Without a library of repeats/TEs to start from, I'm using a number of denovo detection pipelines (RepeatModeler, REPET, etc) to create a library of family consensus sequences.

I'd also like to take advantage of tools that search for individual TE copies based on structure, rather than by all-by-all alignment. For example, LTRharvest or LTR_STRUC (there are many more). A lot of these tools produce results detailing the individual copies of repeats in the genome, rather than families or consensus sequences.

So: What is the best way to get families/consensus sequences from these individual copies? What tools could I use to cluster the sequences and extract common groups?

genome repeats • 2.2k views
ADD COMMENTlink modified 5.9 years ago by Biostar ♦♦ 20 • written 7.6 years ago by David M550
gravatar for Larry_Parnell
7.6 years ago by
Boston, MA USA
Larry_Parnell16k wrote:

Once you have sequences of the individual repeat elements, you could align them with any multiple alignment tool (CLUSTAL, e.g.) that will also generate a consensus sequence. Alternatively, you could try to align those sequences with tools used to align sequencing reads - as though each instance of the repeat is an individual sequence read. This second option will certainly give a consensus sequence as this is the objective of aligning reads.

An important item in this work, though, is where to draw a boundary and split or group repeats into families/sub-families. Clearly, one way to do this is by presence/absence of functional elements such as intact LTRs, autonomous TE, non-autonomous TE, etc. The best advice I can give is to consider how this was done for a near relative of your organism. Thus, if you're working on sorghum or teosinte, I'd study how the repeats were classified in maize (plus other papers by these authors). Alternatively, you could take an alignment of representatives of a type of repeat, align them, build a dendrogram or phylogram and decide based on sequence differences where/how to divide the tree and the aligned groups into X number of sub-classes.

ADD COMMENTlink modified 7.6 years ago • written 7.6 years ago by Larry_Parnell16k

As it currently stands, I have thousands of TE copies, and the structural features (LTRs, Target site duplications, etc) haven't yet been characterized. I was hoping for suggestions on a clustering program (such as those in Vmatch, BLASTClust or UCLUST) that could tentatively divide sequences in to families, at which point I could determine a consensus.

ADD REPLYlink written 7.6 years ago by David M550

When I was doing similar work in Arabidopsis, I spoke with biologists who told me what the characteristics of the LTRs and other hallmarks of TEs are so that I could begin to think of a process by which to identify the different types. Without those things characterized, you may have to resort to the characters of a related species. I think all 3 clustering programs would be good, and good to compare the results before assigning families.

ADD REPLYlink written 7.6 years ago by Larry_Parnell16k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 747 users visited in the last hour