Classifying repeat features from Ensembl
Entering edit mode
16 months ago
liorglic ★ 1.4k

I am working on a genomic analysis involving the evolution of repeat sequences in multiple species. To do that, I'm looking at repeat annotations from Ensembl. Unfortunately, there are lots of repeat types. For example, in the maize genome there are a total of 2,528 repeat types. Here are the 20 most common ones (along with the number of features):

dust    1354712
trf     1011673
RLC_opie_AC198173-5898  99893
RLX_ruda_AC202870-7495  74644
RLC_opie_AC201793-7083  72631
RLX_osed_AC191084-2931  71885
RLC_giepum_AC211251-11074       45512
RLC_opie_AC187207-1792  45084
HUCK1-I_ZM      38139
Gypsy-127_ZM-I  37880
RLG_xilon-diguus_AC203313-7774  33678
RLC_opie_AC197201-5474  30489
PREM2_ZM-int    28503
RLC_ji_AC213834-12382   27229
PREM2_ZM-LTR    26621
RLC_ji_AC211489-11215   26263
PREM1_ZM        24026
PREM1A_ZM_LTR   22852
RLX_iwik_AC203371-7824  21615
HUCK1-LTR_ZM    20066

Can somebody help with suggestions on how to classify these repeat types into several broad categories? More specifically, my questions are:

  1. If you had to classify all repeats into 4-5 (or so) categories, what would they be? e.g. satellite, LTR, etc.
  2. How would you go about transforming 2500 feature types into these 4-5 categories?
  3. Are you aware of any previous work that did something similar? Does my suggested approach even makes sense?

BTW, I am aware of this documentation page, but did not find it very informative or useful since the definitions are rather loose.


ensembl repeat • 1.3k views
Entering edit mode
16 months ago
liorglic ★ 1.4k

Update: just in case somebody gets here in the future.
Turns out the repeat class is available from Ensembl, but AFAIK it is only exposed through the Perl, but not the REST API. Each repeat feature has a related "repeat consensus" entry, which in turn has the attributes "repeat_class" and "repeat_type", which contain the class information. For instance, here are the counts of repeats by type in the maize annotation:

1368363 Dust
    392 Low complexity regions
 915503 LTRs
   1883 RNA repeats
  10986 Satellite repeats
  16069 Simple repeats
1021919 Tandem repeats
 202445 Type II Transposons
  24523 Type I Transposons/LINE
   2745 Type I Transposons/SINE
2080702 Unknown

To fetch this information, I used the following Perl script (pardon my Perl coding...):

use strict;
use Bio::EnsEMBL::Registry;
use List::Util qw(min);

my ($species, $group, $out_bed) = @ARGV;
open(FH, '>', $out_bed) or die $!;

my $registry = 'Bio::EnsEMBL::Registry';

if ($group eq "vertebrates"){
    -host => '', # alternatively ''
    -user => 'anonymous');
    -host => '', # alternatively ''
    -port => 4157,
    -user => 'anonymous');

my $W = 10000000;

my $slice_adaptor = $registry->get_adaptor( $species, 'Core', 'Slice',  );
my @slices = @{ $slice_adaptor->fetch_all('toplevel') };
foreach my $slice (@slices) {
    my $chrom = $slice->seq_region_name();
    my $chrom_length = $slice->length();
    my $s = 1;
    my $e = min($W, $chrom_length);
    while ($e <= $chrom_length){
        my $subslice = $slice->sub_Slice($s,$e);
        my @repeats = @{ $subslice->get_all_RepeatFeatures() };
        foreach my $repeat (@repeats) {
            my $rc = $repeat->repeat_consensus();
            printf FH ( "%s\t%d\t%d\t%s\t%s\t%s\t%s\n",
                    $chrom, $repeat->start(), $repeat->end(), $repeat->display_id(), $rc->repeat_class , $rc->repeat_type, $rc->name);
        last if ($e == $chrom_length);
        $s = $e + 1;
        $e = min($s + $W, $chrom_length);

I hope somebody finds this useful some day.

Entering edit mode

Hello, liorglic and thank you for the interesting solution. The result is a bed file, correct? How do I generate a true .gtf file instead, perhaps you could help me out?

Entering edit mode

Hi there, and sorry for the late reply. It is not trivial to convert a bed file to gtf/gff as these formats contain additional information. I guess one could create some degenerate form though. If you provide an example of the expected output, I might be able to help.

Entering edit mode

Thank you for reaching out, liorglic!

The structure of the gtf files (for the purposes of velocyto, at least) is as follows:

 1. seqname - The name of the sequence. Must be a chromosome or scaffold.
 2. source - The program that generated this feature.
 3. feature - The name of this type of feature. Some examples of standard feature types are "CDS" "start_codon" "stop_codon", "exon", "5UTR", "3UTR", "inter", "inter_CNS", "intron_CNS"
 4. start - The starting position of the feature in the sequence. The first base is numbered 1.
 5. end - The ending position of the feature (inclusive).
 6. score - A score between 0 and 1000. If the track line useScore attribute is set to 1 for this annotation data set, the score value will determine the level of gray in which this feature is displayed (higher numbers = darker gray). If there is no score value, enter ".".
 7. strand - Valid entries include "+", "-", or "." (for don't know/don't care).
 8. frame - If the feature is a coding exon, frame should be a number between 0-2 that represents the reading frame of the first base. If the feature is not a coding exon, the value should be ".".
 9. attributes - Each attribute consists of a type/value pair. Attributes must end in a semi-colon, and be separated from any following attribute by exactly one space

Not sure what this score is though...

Here is an example from the UCSC browser rmsk output:

chr1    UCSC_mm10_rmsk       exon    67108753        67108881        239.000000      +       .       gene_id "RLTR17B_Mm"; transcript_id "RLTR17B_Mm";
chr1    UCSC_mm10_rmsk       exon    134217652       134217732       230.000000      -       .       gene_id "BC1_Mm"; transcript_id "BC1_Mm";
chr1    UCSC_mm10_rmsk       exon    8386826 8389555 8310.000000     -       .       gene_id "Lx2"; transcript_id "Lx2";
Entering edit mode
16 months ago
Felix ▴ 90

It sounds like a good approach to me, but I would think you will need more than 4-5 categories.

trf and dust are obviously just tandem repeat and low-complexity regions and the classification is based on the software used to detect them.

For the specific repeat types: In case your institut has access to RepBase you could query the names against it in an semi-automated way.

Otherwise there might be software to help with the classification, e.g. I can see COSEG a "program which automatically identifies repeat subfamilies".


Login before adding your answer.

Traffic: 1803 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6