I'm currently involved in an annotation project focused on identifying as many repeats as possible in a recently assembled (1st draft) genome. Without a library of repeats/TEs to start from, I'm using a number of denovo detection pipelines (RepeatModeler, REPET, etc) to create a library of family consensus sequences.
I'd also like to take advantage of tools that search for individual TE copies based on structure, rather than by all-by-all alignment. For example, LTRharvest or LTR_STRUC (there are many more). A lot of these tools produce results detailing the individual copies of repeats in the genome, rather than families or consensus sequences.
So: What is the best way to get families/consensus sequences from these individual copies? What tools could I use to cluster the sequences and extract common groups?
As it currently stands, I have thousands of TE copies, and the structural features (LTRs, Target site duplications, etc) haven't yet been characterized. I was hoping for suggestions on a clustering program (such as those in Vmatch, BLASTClust or UCLUST) that could tentatively divide sequences in to families, at which point I could determine a consensus.
When I was doing similar work in Arabidopsis, I spoke with biologists who told me what the characteristics of the LTRs and other hallmarks of TEs are so that I could begin to think of a process by which to identify the different types. Without those things characterized, you may have to resort to the characters of a related species. I think all 3 clustering programs would be good, and good to compare the results before assigning families.