Question

Trying to modify new Repbase .embls to work as RepeatMasker .embl

0

Entering edit mode

8.8 years ago

muppetleague ▴ 10

Repbase has multiple releases a year, so I'm trying to build a script to reformat the concatenated .embls to a format similar to the last database released for RepeatMasker. Every line starts with a couplet that denotes the type of information contained. The nucleotide sequence lines are not shown but are formatted identically.

Here is the format of the latest RepeatMasker .embl

ID   GYPSY68-LTR_AG repeatmasker;    DNA;    ANG; 108 BP.
CC   GYPSY68-LTR_AG DNA
XX
XX
KW   DNA/.
XX
CC   consensus - See RepBase for additional annotations.
XX
CC   RepeatMasker Annotations:
CC        Type: DNA
CC        SubType:
CC        Species: root
CC        SearchStages:
CC        BufferStages:

And here is the formatting I have achieved so far. The ??? are not always present but denote absence of subfamily. There are several entries in the RepeatMasker library with missing fields for the entire annotation section.

ID   IS1     repeatmasker; DNA;  ???;  768 BP.
CC   IS1 DNA
XX
KW   ARTEFACT/.
XX
CC   consensus - See RepBase for additional annotations.
XX
CC   RepeatMasker Annotations:
CC        Type: ARTEFACT
CC        SubType:
CC        Species: root
CC        SearchStages: 10
CC        BufferStages:

And here is my attempt at coding.

#!/usr/bin/python
input = open('Repbase.embl', 'r')

###concatenated files of new release

output = open('RepeatMaskerLib.embl','w')

statement="""CC ****************************************************************
CC                                                                *
CC   RepeatMasker Database                                        *
CC   (C) 1997-2011  Genetic Information Research Institute        *
CC   All rights reserved                                          *
CC                                                                *
CC   Prepared by: Smit, A., Hubley, R.                            *
CC   See accompanying README.html/README.txt for details.         *
CC                                                                *
CC   RELEASE YEARHERE;                                            *
CC                                                                *
CC   RepeatMasker software and database development and           *
CC   maintenance are currently funded by an NIH/NHGRI             *
CC   R01 grant HG02939-01 to Arian Smit.  RepBase Update          *
CC   development and maintenance are funded by NIH/NLM grant      *
CC   No.2P41LM006252-07A1 to Jerzy Jurka.                         *
CC                                                                *
CC ****************************************************************
XX"""
output.write(statement + "\n")
badlines=('DT','DE','AC','RN','RP','RA', 'KW','RT','RL','DR', 'FH', 'FT', 'OS', 'OC', 'NM', 'CC', 'RX', 'RC')

###Comment lines start with character couplets, these are not used in the RepeatMasker .embl

def skip_badman(file):
    for line in file:
        if not line.startswith(badlines):
            yield line
for line in skip_badman(input):

 ####Here I'm hijacking the ID line as the place to jump in and reinsert only the comment lines used in the latest released RepeatMasker database file

    if line.startswith('ID'):        
        new_line = line.split()

        output.write(line.replace('repbase', 'repeatmasker'))
        output.write("CC" + "   " + new_line[1] + " " + new_line[3].replace(';', '') + "\n") 
        output.write("XX" + "\n")
        output.write("DE" + "   RepbaseID: " + new_line[1] + "\n")
        output.write("XX" + "\n")
        output.write("KW" + "   " + new_line[3].replace(';', '') + "/." + "\n")
        output.write("XX" + "\n")
        output.write("CC" + "   consensus - See RepBase for additional annotations." + "\n")
        output.write("XX" + "\n")
        output.write("CC" + "   RepeatMasker Annotations:" + "\n")
        output.write("CC" + "        Type: " + new_line[3].replace(';', '') + "\n")
        output.write("CC" + "        SubType:" + "\n")
        output.write("CC" + "        Species: root" + "\n")
        output.write("CC" + "        SearchStages:" + "\n")
        output.write("CC" + "        BufferStages:" + "\n")
        output.write("XX" + "\n")
        output.write("RC" + "\n")
    else:
        output.write(line.replace('con','cnn')
        ###one-off correction for a 3-letter motif in satellite sequence
output.close()

I get this error when running RepeatMasker without -lib specified:

Checking for E. coli insertion elements
NCBIBlastSearchEngine::search: Error...compressed subject database (/home/hdd/4/RepeatMasker/Libraries//general/is.lib) does not exist!
 at ./RepeatMasker line 2018.
WARNING: Retrying batch ( 1 ) [ 2,, 12131]...

RepeatMasker creates these .lib files in a temp directory during runtime, but I cannot figure out which field denotes a different .lib to be written to, or if other locations are being referenced.

Repbase RepeatMasker Transposable-elements Python • 4.7k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.8 years ago by muppetleague ▴ 10

Ram · Accepted Answer · 2015-07-07

2

Entering edit mode

8.8 years ago

SES 8.6k

The error is saying the library of Insertion Sequences could not be found. You could likely avoid this by running RM with the "-nois" option. It is not entirely clear this error is related to your library format, so you may want to simply try again without the IS check. Also, I would suggest using a custom Fasta database instead of putting yourself through the trouble of recreating this EMBL-like format (unless there is a specific reason to do so). There is a script in the "util" directory of the distribution that has script to convert their EMBL-like format to Fasta, and that may be a helpful reference if you are tied to the format.

ADD COMMENT • link 8.8 years ago by SES 8.6k

0

Entering edit mode

Thanks for answering! Really appreciate your work on the Transposome program. I tried the -nois and -no_low (forget the exact characters of that flag) and then it stalled out on creating the SINE lib. I was unaware of the formatting script!

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.8 years ago by muppetleague ▴ 10

0

Entering edit mode

Thanks! So, did you get it working in the end? There should be an easy-to-use script to go from RepBase to RepeatMasker format. It must exist because they generate the files from RepBase (just not with every release). The RepeatMasker folks might have something like this if you ask.

ADD REPLY • link 8.8 years ago by SES 8.6k

0

Entering edit mode

Oh yeah, it worked great! I had no reason to stay with the embl format, I just had a case of tunnel vision.

ADD REPLY • link 8.8 years ago by muppetleague ▴ 10

0

Entering edit mode

Hi, @muppetleague - wondering if you remember how you managed to get "Repbase .embls to work as RepeatMasker .embl"...

Did you get it to support -species?

Did you get it so work without -no_is ?

Thanks!

ADD REPLY • link 6.8 years ago by Malcolm.Cook ★ 1.5k