Repeat Modeler sampling workings?
Entering edit mode
7.1 years ago
Lesley Sitter ▴ 560

Hi, i am currently using Maker to annotate a de novo genome and i'm using Repeat Modeler to find repeats in this genome. There is however very little information about the inner workings of Repeat Modeler and I was wondering if someone could help me clear something up.

I was wondering how the sampling works in RepeatModeler. On their webpage they have the following example data

            Genome DB    Sample***   Run Time*   Models   Models       % Sample
Genome      Size (bp)    Size (bp)   (hh:mm)     Built    Classified   Masked**
----------  -----------  ----------  ----------  -------  -----------  --------
Human HG18     3.1 Bbp      238 Mbp   46:36        614      611         35.66
Zebrafinch     1.3 Bbp      220 Mbp   63:57        233      104          9.41
Sea Urchin     867 Mbp      220 Mbp   40:03       1830      360         33.85
diatom       32,930,227  32,930,227    4:41        128       35          2.86
Rabbit       11,770,949  11,770,949    3:14         83       72         31.30
  *** Sample size does not include 40 Mbp used in the RepeatScout analysis.
      This 40 Mbp is randomly chosen and may overlap 0-100% of the
      sample used in the RECON analysis.

So it's clear that they use 40mb for RepeatScout, but does it take it from the biggest contig or does it just take multiple contigs until it reaches the 40 mb?

Then they reach the +-220mb for most test but after looking at the RepeatModeler's code there are some static limits that seem to contradict these sample sizes. Below you can see the lines i mean (line numbers 255 through 263)

my $rsSampleSize               = 40000000;     # The size of the sequence given to RepeatScout. ( round #1 )
my $fragmentSize               = 40000;        # The size of the batches for all-vs-other search.
my $genomeSampleSizeStart      = 3000000;      # The initial sample size for RECON analysis
my $genomeSampleSizeMax        = 100000000;    # The max sample size for RECON analysis
my $genomeSampleSizeMultiplier = 3;            # The multiplier for sample size between rounds

These limits suggest that 40mb is sent to RepeatScout and 3mb (round 2), 9mb (round 3), 27mb (round 4) and 81mb (round 5) is sent to RECON (because the max sample size is 100mb it can't start a 6th round) this creates a total of 160mb sampled in my train of thought. So what I am missing in this process?

Thanks in advance for any information regarding this subject :D

RECON RepeatScout subset sampling RepeatModeler • 3.6k views
Entering edit mode
6.6 years ago

I was wondering this as well, and found your question.

I ran RepeatModeler on my genome (~1 GB), and after seeing the log - I see how they got ~220 MB sampled.

There is a round 6 that takes place, but since 81mb*3 is over the static limit you pointed out - round 6 is sampled with 100mb, which is exactly the $genomeSampleSizeMax you pointed out.
Therefore, 5 rounds is sent to RECON: 3mb + 9mb + 27mb + 81mb + 100mb = 220mb

RepeatModeler Round # 6
Searching for Repeats
 -- Sampling from the database...
   - Gathering up to 100000000 bp
-- Sample Stats:
       Sample Size 109817094 bp
       Num Contigs Represented = 1196
-- Input Database Coverage: 241713052 bp out of 998723456 bp ( 24.20 % )

I suppose one could always modify that hard limit you mentioned, but in my case, I think 24% of the genome being sampled total is fine.

Entering edit mode

Thanks that seems to explain why they end up with the 220 on two of their runs. 

But then how come they reach the 238 mb in their human tests set? It seems so weird that there is no explanation about this crucial part of their program on the website. 


Login before adding your answer.

Traffic: 2505 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6