Best Way To Identify Families From Individual Transposable Element Copies?
1
3
Entering edit mode
12.5 years ago
David M ▴ 580

I'm currently involved in an annotation project focused on identifying as many repeats as possible in a recently assembled (1st draft) genome. Without a library of repeats/TEs to start from, I'm using a number of denovo detection pipelines (RepeatModeler, REPET, etc) to create a library of family consensus sequences.

I'd also like to take advantage of tools that search for individual TE copies based on structure, rather than by all-by-all alignment. For example, LTRharvest or LTR_STRUC (there are many more). A lot of these tools produce results detailing the individual copies of repeats in the genome, rather than families or consensus sequences.

So: What is the best way to get families/consensus sequences from these individual copies? What tools could I use to cluster the sequences and extract common groups?

repeats genome • 3.4k views
ADD COMMENT
2
Entering edit mode
12.5 years ago

Once you have sequences of the individual repeat elements, you could align them with any multiple alignment tool (CLUSTAL, e.g.) that will also generate a consensus sequence. Alternatively, you could try to align those sequences with tools used to align sequencing reads - as though each instance of the repeat is an individual sequence read. This second option will certainly give a consensus sequence as this is the objective of aligning reads.

An important item in this work, though, is where to draw a boundary and split or group repeats into families/sub-families. Clearly, one way to do this is by presence/absence of functional elements such as intact LTRs, autonomous TE, non-autonomous TE, etc. The best advice I can give is to consider how this was done for a near relative of your organism. Thus, if you're working on sorghum or teosinte, I'd study how the repeats were classified in maize (plus other papers by these authors). Alternatively, you could take an alignment of representatives of a type of repeat, align them, build a dendrogram or phylogram and decide based on sequence differences where/how to divide the tree and the aligned groups into X number of sub-classes.

ADD COMMENT
0
Entering edit mode

As it currently stands, I have thousands of TE copies, and the structural features (LTRs, Target site duplications, etc) haven't yet been characterized. I was hoping for suggestions on a clustering program (such as those in Vmatch, BLASTClust or UCLUST) that could tentatively divide sequences in to families, at which point I could determine a consensus.

ADD REPLY
0
Entering edit mode

When I was doing similar work in Arabidopsis, I spoke with biologists who told me what the characteristics of the LTRs and other hallmarks of TEs are so that I could begin to think of a process by which to identify the different types. Without those things characterized, you may have to resort to the characters of a related species. I think all 3 clustering programs would be good, and good to compare the results before assigning families.

ADD REPLY

Login before adding your answer.

Traffic: 1855 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6