We have a denovo genome assembly, and are looking for repetitive elements (transposons, ideally) for submission to NCBI and RepBase. So far, the plan is:
- Mask known repeats in the genome with RepeatMasker and the RepBase libraries
- Denovo repeat finding on the masked genome with RepeatScout, including filtering out low complexity regions that RepeatMasker didn't pick up.
- Filter out repeats that have matches in gene regions (the sequences are likely to belong to a gene family, or be part of a conserved domain)
- Blast each of the repeat sequences identified by RepeatScout against NR, discarding sequences that match genes or previously identified transposons.
- Submit remaining sequences to RepBase and NCBI as unclassified repeats.
This process feels incomplete to me, and doesn't include any classification. Is there a formal process for identification and classification of repetitive elements in denovo genome assemblies?