No SINEs, LINEs, LTR and DNA elements detected by RepeatMasker
1
0
Entering edit mode
7 weeks ago
Shri hari ▴ 30

Hi, I was trying to repeatmask a plant genome using RepeatMasker using the following command

./RepeatMasker -species cocos -s -a -poly -dir /home/dic/Desktop/software /home/dic/Desktop/softwares/coconut/GCA_008124465.1_ASM812446v1_genomic.fna

Using combined dfam and repbase library. still the output is as follows with no lines, sines etc. detected. Does anybody have any ideas about this? Many thanks. I'm using "RepeatMasker version open-4.1."

output:

sequences:         59328
total length: 1839172334 bp  (1567667895 bp excl N/X-runs)
GC level:         37.31 %
bases masked:   19191682 bp ( 1.04 %)

Retroelements        47004     19090771 bp    1.04 %
SINEs:                0            0 bp    0.00 %
Penelope              0            0 bp    0.00 %
LINEs:                0            0 bp    0.00 %
CRE/SLACS            0            0 bp    0.00 %
L2/CR1/Rex          0            0 bp    0.00 %
R1/LOA/Jockey       0            0 bp    0.00 %
R2/R4/NeSL          0            0 bp    0.00 %
RTE/Bov-B           0            0 bp    0.00 %
L1/CIN4             0            0 bp    0.00 %
LTR elements:     47004     19090771 bp    1.04 %
BEL/Pao             0            0 bp    0.00 %
Ty1/Copia       47004     19090771 bp    1.04 %
Gypsy/DIRS1         0            0 bp    0.00 %
Retroviral        0            0 bp    0.00 %

DNA transposons          0            0 bp    0.00 %
hobo-Activator        0            0 bp    0.00 %
Tc1-IS630-Pogo        0            0 bp    0.00 %
En-Spm                0            0 bp    0.00 %
MuDR-IS905            0            0 bp    0.00 %
PiggyBac              0            0 bp    0.00 %
Tourist/Harbinger     0            0 bp    0.00 %
Other (Mirage,        0            0 bp    0.00 %
P-element, Transib)

Rolling-circles          0            0 bp    0.00 %

Unclassified:            5          264 bp    0.00 %

Total interspersed repeats:    19091035 bp    1.04 %

Small RNA:            1033       100647 bp    0.01 %

Satellites:              0            0 bp    0.00 %
Simple repeats:          0            0 bp    0.00 %
Low complexity:          0            0 bp    0.00 %

0
Entering edit mode

Hi, I got some results for the coconut. Btw, it looks like your input file was truncated, because I got a much larger assembly file.

0
Entering edit mode

It finished repeat masking successfully i think since the libraries lacked the TE elements the masking percentage was way too low could you please share the results?

0
Entering edit mode

I have uploaded the result data to figshare, hope it is useful.

0
Entering edit mode

thank you for your kind help

2
Entering edit mode
7 weeks ago

I have made the results available on figshare:

Dondrup, Michael (2021): Coconut Repeat Analysis. figshare. Dataset. https://doi.org/10.6084/m9.figshare.14540553.v1

I ran the de novo pipeline here are my results. Almost 80% gets masked. Largest group is LTR elements. I can give you the results if you need.

==================================================
file name: GCA_008124465.1_ASM812446v1_genomic.fna
sequences:        111366
total length: 2202455121 bp  (2147245538 bp excl N/X-runs)
GC level:         37.30 %
bases masked: 1745486145 bp ( 79.25 %)
==================================================
number of      length   percentage
elements*    occupied  of sequence
--------------------------------------------------
Retroelements       642929   1048817449 bp   47.62 %
SINEs:                0            0 bp    0.00 %
Penelope              0            0 bp    0.00 %
LINEs:            19088     10652105 bp    0.48 %
CRE/SLACS            0            0 bp    0.00 %
L2/CR1/Rex          0            0 bp    0.00 %
R1/LOA/Jockey       0            0 bp    0.00 %
R2/R4/NeSL          0            0 bp    0.00 %
RTE/Bov-B        4949      1398561 bp    0.06 %
L1/CIN4         14139      9253544 bp    0.42 %
LTR elements:    623841   1038165344 bp   47.14 %
BEL/Pao             0            0 bp    0.00 %
Ty1/Copia      396847    758466549 bp   34.44 %
Gypsy/DIRS1    221729    275864286 bp   12.53 %
Retroviral        0            0 bp    0.00 %

DNA transposons      56974     47203403 bp    2.14 %
hobo-Activator     5504      3990626 bp    0.18 %
Tc1-IS630-Pogo        0            0 bp    0.00 %
En-Spm                0            0 bp    0.00 %
MuDR-IS905            0            0 bp    0.00 %
PiggyBac              0            0 bp    0.00 %
Tourist/Harbinger   280       181687 bp    0.01 %
Other (Mirage,        0            0 bp    0.00 %
P-element, Transib)

Rolling-circles       4250      2781242 bp    0.13 %

Unclassified:       1315850    632569142 bp   28.72 %

Total interspersed repeats:  1728589994 bp   78.48 %

Small RNA:               0            0 bp    0.00 %

Satellites:              0            0 bp    0.00 %
Simple repeats:     243326     12057689 bp    0.55 %
Low complexity:      39003      2057220 bp    0.09 %
==================================================

* most repeats fragmented by insertions or deletions
have been counted as one element

RepeatMasker version 4.1.1 , sensitive mode

run with rmblastn version 2.10.0+
The query was compared to classified sequences in "GCA_008124465.1_ASM812446v1_genomic.fna-families.fa"


Edit: Looking at your output, it seems like something else is wrong. 0% satelites, simple repeats, and low complexity? This cannot be correct. Please check your installation and input sequence, something is very odd.

Are you sure there are any annotated families in RepBase and Dfam for your species, and if so how good are they? The problem of the -species switch is that RepeatMasker does not tell you how many repeats are annotated or if there are any at all. In my experience for most except a handful of model species using the -species switch rarely does any good, instead one has to use RepeatModeler first and predict repeats de novo then run RepeatMasker on the results. I would therefore recommend to run repeat detection de novo every time, it also makes muti-species comparisons possible, because otherwise the result mostly depends on the quality of the database entries and not the genome sequence.

0
Entering edit mode

yes I do feel there is some problem with the result. I tried to run RepeatModeller after seeing the result in order to run with a custom repeat library but my repeatmodeler shows error when it runs classification showing missing repeatmasker.lib.nsq I'm totally confused

0
Entering edit mode

It is not that easy to install these tool-sets from scratch. Could it be that you installed them via Conda? That is generally not advisable, I think the RepeatModeler package in conda is still broken. I had to install everything from scratch following the installation instructions for each tools meticulously (including getting the exact same versions of tools as required) including the dependencies. A lot of people also seem to forget that they need to run the configuration scripts in the correct order, 1. Rep.Masker 2. RepModeler (Modeler needs masker libraries)

If you can provide complete debugging output I could maybe help better. Alternatively, you could try to run the tools in GeneSAS. These are not the newest versions but should be ok.

0
Entering edit mode

I'm trying to run repepat modeler to prepare custom library for TE library preparation could you please share the results and the library if possible