I am working on a genomic analysis involving the evolution of repeat sequences in multiple species. To do that, I'm looking at repeat annotations from Ensembl. Unfortunately, there are lots of repeat types. For example, in the maize genome there are a total of 2,528 repeat types. Here are the 20 most common ones (along with the number of features):
dust    1354712
trf     1011673
RLC_opie_AC198173-5898  99893
RLX_ruda_AC202870-7495  74644
RLC_opie_AC201793-7083  72631
RLX_osed_AC191084-2931  71885
RLC_giepum_AC211251-11074       45512
RLC_opie_AC187207-1792  45084
HUCK1-I_ZM      38139
Gypsy-127_ZM-I  37880
RLG_xilon-diguus_AC203313-7774  33678
RLC_opie_AC197201-5474  30489
PREM2_ZM-int    28503
RLC_ji_AC213834-12382   27229
PREM2_ZM-LTR    26621
RLC_ji_AC211489-11215   26263
PREM1_ZM        24026
PREM1A_ZM_LTR   22852
RLX_iwik_AC203371-7824  21615
HUCK1-LTR_ZM    20066
Can somebody help with suggestions on how to classify these repeat types into several broad categories? More specifically, my questions are:
- If you had to classify all repeats into 4-5 (or so) categories, what would they be? e.g. satellite, LTR, etc.
 - How would you go about transforming 2500 feature types into these 4-5 categories?
 - Are you aware of any previous work that did something similar? Does my suggested approach even makes sense?
 
BTW, I am aware of this documentation page, but did not find it very informative or useful since the definitions are rather loose.
Thanks!
Hello, liorglic and thank you for the interesting solution. The result is a
bedfile, correct? How do I generate a true.gtffile instead, perhaps you could help me out?Hi there, and sorry for the late reply. It is not trivial to convert a bed file to gtf/gff as these formats contain additional information. I guess one could create some degenerate form though. If you provide an example of the expected output, I might be able to help.
Thank you for reaching out, liorglic!
The structure of the gtf files (for the purposes of velocyto, at least) is as follows:
Not sure what this
scoreis though...Here is an example from the UCSC browser rmsk output: