Question: Trying to modify new Repbase .embls to work as RepeatMasker .embl
gravatar for muppetleague
5.4 years ago by
United States
muppetleague10 wrote:

Repbase has multiple releases a year, so I'm trying to build a script to reformat the concatenated .embls to a format similar to the last database released for RepeatMasker. Every line starts with a couplet that denotes the type of information contained. The nucleotide sequence lines are not shown but are formatted identically. 

Here is the format of the latest RepeatMasker .embl

ID   GYPSY68-LTR_AG repeatmasker;    DNA;    ANG; 108 BP.
KW   DNA/.
CC   consensus - See RepBase for additional annotations.
CC   RepeatMasker Annotations:
CC        Type: DNA
CC        SubType:
CC        Species: root
CC        SearchStages:
CC        BufferStages:

And here is the formatting I have achieved so far. The ??? are not always present but denote absence of subfamily. There are several entries in the RepeatMasker library with missing fields for the entire annotation section. 

ID   IS1     repeatmasker; DNA;  ???;  768 BP.
CC   consensus - See RepBase for additional annotations.
CC   RepeatMasker Annotations:
CC        Type: ARTEFACT
CC        SubType:
CC        Species: root
CC        SearchStages: 10
CC        BufferStages:

And here is my attempt at coding.

input = open('Repbase.embl', 'r')

###concatenated files of new release

output = open('RepeatMaskerLib.embl','w')

statement="""CC ****************************************************************
CC                                                                *
CC   RepeatMasker Database                                        *
CC   (C) 1997-2011  Genetic Information Research Institute        *
CC   All rights reserved                                          *
CC                                                                *
CC   Prepared by: Smit, A., Hubley, R.                            *
CC   See accompanying README.html/README.txt for details.         *
CC                                                                *
CC   RELEASE YEARHERE;                                            *
CC                                                                *
CC   RepeatMasker software and database development and           *
CC   maintenance are currently funded by an NIH/NHGRI             *
CC   R01 grant HG02939-01 to Arian Smit.  RepBase Update          *
CC   development and maintenance are funded by NIH/NLM grant      *
CC   No.2P41LM006252-07A1 to Jerzy Jurka.                         *
CC                                                                *
CC ****************************************************************
output.write(statement + "\n")
badlines=('DT','DE','AC','RN','RP','RA', 'KW','RT','RL','DR', 'FH', 'FT', 'OS', 'OC', 'NM', 'CC', 'RX', 'RC')

###Comment lines start with character couplets, these are not used in the RepeatMasker .embl

def skip_badman(file):
    for line in file:
        if not line.startswith(badlines):
            yield line
for line in skip_badman(input):

 ####Here I'm hijacking the ID line as the place to jump in and reinsert only the comment lines used in the latest released RepeatMasker database file

    if line.startswith('ID'):        
        new_line = line.split()

        output.write(line.replace('repbase', 'repeatmasker'))
        output.write("CC" + "   " + new_line[1] + " " + new_line[3].replace(';', '') + "\n") 
        output.write("XX" + "\n")
        output.write("DE" + "   RepbaseID: " + new_line[1] + "\n")
        output.write("XX" + "\n")
        output.write("KW" + "   " + new_line[3].replace(';', '') + "/." + "\n")
        output.write("XX" + "\n")
        output.write("CC" + "   consensus - See RepBase for additional annotations." + "\n")
        output.write("XX" + "\n")
        output.write("CC" + "   RepeatMasker Annotations:" + "\n")
        output.write("CC" + "        Type: " + new_line[3].replace(';', '') + "\n")
        output.write("CC" + "        SubType:" + "\n")
        output.write("CC" + "        Species: root" + "\n")
        output.write("CC" + "        SearchStages:" + "\n")
        output.write("CC" + "        BufferStages:" + "\n")
        output.write("XX" + "\n")
        output.write("RC" + "\n")
        ###one-off correction for a 3-letter motif in satellite sequence

I get this error when running RepeatMasker without -lib specified:

Checking for E. coli insertion elements
NCBIBlastSearchEngine::search: Error...compressed subject database (/home/hdd/4/RepeatMasker/Libraries//general/is.lib) does not exist!
 at ./RepeatMasker line 2018.
WARNING: Retrying batch ( 1 ) [ 2,, 12131]...

RepeatMasker creates these .lib files in a temp directory during runtime, but I cannot figure out which field denotes a different .lib to be written to, or if other locations are being referenced.

ADD COMMENTlink modified 5.4 years ago by SES8.4k • written 5.4 years ago by muppetleague10
gravatar for SES
5.4 years ago by
Vancouver, BC
SES8.4k wrote:

The error is saying the library of Insertion Sequences could not be found. You could likely avoid this by running RM with the "-nois" option. It is not entirely clear this error is related to your library format, so you may want to simply try again without the IS check. Also, I would suggest using a custom Fasta database instead of putting yourself through the trouble of recreating this EMBL-like format (unless there is a specific reason to do so). There is a script in the "util" directory of the distribution that has script to convert their EMBL-like format to Fasta, and that may be a helpful reference if you are tied to the format.

ADD COMMENTlink written 5.4 years ago by SES8.4k

Thanks for answering! Really appreciate your work on the Transposome program. I tried the -nois and -no_low (forget the exact characters of that flag) and then it stalled out on creating the SINE lib. I was unaware of the formatting script! 

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by muppetleague10

Thanks! So, did you get it working in the end? There should be an easy-to-use script to go from RepBase to RepeatMasker format. It must exist because they generate the files from RepBase (just not with every release). The RepeatMasker folks might have something like this if you ask.

ADD REPLYlink modified 5.4 years ago • written 5.4 years ago by SES8.4k

Oh yeah, it worked great! I had no reason to stay with the embl format, I just had a case of tunnel vision.

ADD REPLYlink written 5.4 years ago by muppetleague10

Hi, @muppetleague - wondering if you remember how you managed to get "Repbase .embls to work as RepeatMasker .embl"...

Did you get it to support -species?

Did you get it so work without -no_is ?


ADD REPLYlink written 3.4 years ago by Malcolm.Cook1.1k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1940 users visited in the last hour