Question

Best Way To Identify Families From Individual Transposable Element Copies?

3

Entering edit mode

12.5 years ago

David M ▴ 580

I'm currently involved in an annotation project focused on identifying as many repeats as possible in a recently assembled (1st draft) genome. Without a library of repeats/TEs to start from, I'm using a number of denovo detection pipelines (RepeatModeler, REPET, etc) to create a library of family consensus sequences.

I'd also like to take advantage of tools that search for individual TE copies based on structure, rather than by all-by-all alignment. For example, LTRharvest or LTR_STRUC (there are many more). A lot of these tools produce results detailing the individual copies of repeats in the genome, rather than families or consensus sequences.

So: What is the best way to get families/consensus sequences from these individual copies? What tools could I use to cluster the sequences and extract common groups?

repeats genome • 3.4k views

ADD COMMENT • link updated 10.8 years ago by Biostar 20 • written 12.5 years ago by David M ▴ 580

score 2 · Answer 1 · 2011-10-27

2

Entering edit mode

12.5 years ago

Larry_Parnell 16k

Once you have sequences of the individual repeat elements, you could align them with any multiple alignment tool (CLUSTAL, e.g.) that will also generate a consensus sequence. Alternatively, you could try to align those sequences with tools used to align sequencing reads - as though each instance of the repeat is an individual sequence read. This second option will certainly give a consensus sequence as this is the objective of aligning reads.

An important item in this work, though, is where to draw a boundary and split or group repeats into families/sub-families. Clearly, one way to do this is by presence/absence of functional elements such as intact LTRs, autonomous TE, non-autonomous TE, etc. The best advice I can give is to consider how this was done for a near relative of your organism. Thus, if you're working on sorghum or teosinte, I'd study how the repeats were classified in maize (plus other papers by these authors). Alternatively, you could take an alignment of representatives of a type of repeat, align them, build a dendrogram or phylogram and decide based on sequence differences where/how to divide the tree and the aligned groups into X number of sub-classes.

ADD COMMENT • link 12.5 years ago by Larry_Parnell 16k

0

Entering edit mode

As it currently stands, I have thousands of TE copies, and the structural features (LTRs, Target site duplications, etc) haven't yet been characterized. I was hoping for suggestions on a clustering program (such as those in Vmatch, BLASTClust or UCLUST) that could tentatively divide sequences in to families, at which point I could determine a consensus.

ADD REPLY • link 12.5 years ago by David M ▴ 580

0

Entering edit mode

When I was doing similar work in Arabidopsis, I spoke with biologists who told me what the characteristics of the LTRs and other hallmarks of TEs are so that I could begin to think of a process by which to identify the different types. Without those things characterized, you may have to resort to the characters of a related species. I think all 3 clustering programs would be good, and good to compare the results before assigning families.

ADD REPLY • link 12.5 years ago by Larry_Parnell 16k