Trying to modify new Repbase .embls to work as RepeatMasker .embl
1
0
Entering edit mode
6.1 years ago
muppetleague ▴ 10

Repbase has multiple releases a year, so I'm trying to build a script to reformat the concatenated .embls to a format similar to the last database released for RepeatMasker. Every line starts with a couplet that denotes the type of information contained. The nucleotide sequence lines are not shown but are formatted identically.

Here is the format of the latest RepeatMasker .embl

ID   GYPSY68-LTR_AG repeatmasker;    DNA;    ANG; 108 BP.
CC   GYPSY68-LTR_AG DNA
XX
XX
KW   DNA/.
XX
CC   consensus - See RepBase for additional annotations.
XX
CC        Type: DNA
CC        SubType:
CC        Species: root
CC        SearchStages:
CC        BufferStages:

And here is the formatting I have achieved so far. The ??? are not always present but denote absence of subfamily. There are several entries in the RepeatMasker library with missing fields for the entire annotation section.

ID   IS1     repeatmasker; DNA;  ???;  768 BP.
CC   IS1 DNA
XX
KW   ARTEFACT/.
XX
CC   consensus - See RepBase for additional annotations.
XX
CC        Type: ARTEFACT
CC        SubType:
CC        Species: root
CC        SearchStages: 10
CC        BufferStages:

And here is my attempt at coding.

#!/usr/bin/python
input = open('Repbase.embl', 'r')

###concatenated files of new release

statement="""CC ****************************************************************
CC                                                                *
CC   (C) 1997-2011  Genetic Information Research Institute        *
CC                                                                *
CC   Prepared by: Smit, A., Hubley, R.                            *
CC                                                                *
CC   RELEASE YEARHERE;                                            *
CC                                                                *
CC   RepeatMasker software and database development and           *
CC   maintenance are currently funded by an NIH/NHGRI             *
CC   R01 grant HG02939-01 to Arian Smit.  RepBase Update          *
CC   development and maintenance are funded by NIH/NLM grant      *
CC   No.2P41LM006252-07A1 to Jerzy Jurka.                         *
CC                                                                *
CC ****************************************************************
XX"""
output.write(statement + "\n")
badlines=('DT','DE','AC','RN','RP','RA', 'KW','RT','RL','DR', 'FH', 'FT', 'OS', 'OC', 'NM', 'CC', 'RX', 'RC')

for line in file:
yield line

####Here I'm hijacking the ID line as the place to jump in and reinsert only the comment lines used in the latest released RepeatMasker database file

if line.startswith('ID'):
new_line = line.split()

output.write("CC" + "   " + new_line[1] + " " + new_line[3].replace(';', '') + "\n")
output.write("XX" + "\n")
output.write("DE" + "   RepbaseID: " + new_line[1] + "\n")
output.write("XX" + "\n")
output.write("KW" + "   " + new_line[3].replace(';', '') + "/." + "\n")
output.write("XX" + "\n")
output.write("CC" + "   consensus - See RepBase for additional annotations." + "\n")
output.write("XX" + "\n")
output.write("CC" + "   RepeatMasker Annotations:" + "\n")
output.write("CC" + "        Type: " + new_line[3].replace(';', '') + "\n")
output.write("CC" + "        SubType:" + "\n")
output.write("CC" + "        Species: root" + "\n")
output.write("CC" + "        SearchStages:" + "\n")
output.write("CC" + "        BufferStages:" + "\n")
output.write("XX" + "\n")
output.write("RC" + "\n")
else:
output.write(line.replace('con','cnn')
###one-off correction for a 3-letter motif in satellite sequence
output.close()

I get this error when running RepeatMasker without -lib specified:

Checking for E. coli insertion elements
NCBIBlastSearchEngine::search: Error...compressed subject database (/home/hdd/4/RepeatMasker/Libraries//general/is.lib) does not exist!
WARNING: Retrying batch ( 1 ) [ 2,, 12131]...

RepeatMasker creates these .lib files in a temp directory during runtime, but I cannot figure out which field denotes a different .lib to be written to, or if other locations are being referenced.

Transposable elements RepeatMasker Repbase Python • 3.4k views
2
Entering edit mode
6.0 years ago
SES 8.5k

The error is saying the library of Insertion Sequences could not be found. You could likely avoid this by running RM with the "-nois" option. It is not entirely clear this error is related to your library format, so you may want to simply try again without the IS check. Also, I would suggest using a custom Fasta database instead of putting yourself through the trouble of recreating this EMBL-like format (unless there is a specific reason to do so). There is a script in the "util" directory of the distribution that has script to convert their EMBL-like format to Fasta, and that may be a helpful reference if you are tied to the format.

0
Entering edit mode

Thanks for answering! Really appreciate your work on the Transposome program. I tried the -nois and -no_low (forget the exact characters of that flag) and then it stalled out on creating the SINE lib. I was unaware of the formatting script!

0
Entering edit mode

Thanks! So, did you get it working in the end? There should be an easy-to-use script to go from RepBase to RepeatMasker format. It must exist because they generate the files from RepBase (just not with every release). The RepeatMasker folks might have something like this if you ask.

0
Entering edit mode

Oh yeah, it worked great! I had no reason to stay with the embl format, I just had a case of tunnel vision.

0
Entering edit mode

Hi, @muppetleague - wondering if you remember how you managed to get "Repbase .embls to work as RepeatMasker .embl"...

Did you get it to support -species?

Did you get it so work without -no_is ?

Thanks!