Question: How Do You Identify And Classify Novel Repetitive Elements In A Denovo Genome?
9.1 years ago by
Rob Syme

We have a denovo genome assembly, and are looking for repetitive elements (transposons, ideally) for submission to NCBI and RepBase. So far, the plan is:

  1. Mask known repeats in the genome with RepeatMasker and the RepBase libraries
  2. Denovo repeat finding on the masked genome with RepeatScout, including filtering out low complexity regions that RepeatMasker didn't pick up.
  3. Filter out repeats that have matches in gene regions (the sequences are likely to belong to a gene family, or be part of a conserved domain)
  4. Blast each of the repeat sequences identified by RepeatScout against NR, discarding sequences that match genes or previously identified transposons.
  5. Submit remaining sequences to RepBase and NCBI as unclassified repeats.

This process feels incomplete to me, and doesn't include any classification. Is there a formal process for identification and classification of repetitive elements in denovo genome assemblies?

8.1 years ago by
Casey Bergman

Try running REPCLASS or TEclass on the output of RepeatScout (or RECON) for classification of putative TEs.

REPCLASS uses both homology (HOM) and structural (STR) information in the input sequences, as well as a scan of the de novo genome assembly using the input library to find target site duplications (TSDs) that are characteristic of TE classes:

alt text

TEclass uses oligomer frequencies of known TEs to train classifiers of different sequence lengths that are applied in series as follows:

alt text

9.1 years ago by
It does not cover classification, but you may find this page useful.

The descriptions there don't go much further than what I had already outlined, but it did link me to the very comprehensive list of tools at the Bergman Lab, which led me to their review article. I might sketch out an answer based on the article later.

