3.8 years ago
SeaStar ▴ 50

hello! I'm analyzing the genome of a cephalopoda. I have my genome.fa and my custom library. I put this command on repeatmasker:

\$:~/RepeatMasker -lib repeatlib.fa -dir output_file mygenome.fa


Is it correct? Or I have to add something like the species? Because the output generate appears to be without elements:

==================================================
file name: mygenome.fa
sequences:          1000
total length:    1052553 bp  (1041046 bp excl N/X-runs)
GC level:         34.60 %
bases masked:     697079 bp ( 66.23 %)
==================================================
number of      length   percentage
elements*    occupied  of sequence
--------------------------------------------------
SINEs:                0            0 bp    0.00 %
ALUs            0            0 bp    0.00 %
MIRs            0            0 bp    0.00 %

LINEs:                0            0 bp    0.00 %
LINE1           0            0 bp    0.00 %
LINE2           0            0 bp    0.00 %
L3/CR1          0            0 bp    0.00 %

LTR elements:         0            0 bp    0.00 %
ERVL            0            0 bp    0.00 %
ERVL-MaLRs      0            0 bp    0.00 %
ERV_classI      0            0 bp    0.00 %
ERV_classII     0            0 bp    0.00 %

DNA elements:         0            0 bp    0.00 %
hAT-Charlie      0            0 bp    0.00 %
TcMar-Tigger     0            0 bp    0.00 %

Unclassified:      5436       722760 bp   68.67 %

Total interspersed repeats:   722760 bp   68.67 %

Small RNA:            0            0 bp    0.00 %

Satellites:           0            0 bp    0.00 %
Simple repeats:    1511        93735 bp    8.91 %
Low complexity:       0            0 bp    0.00 %
==================================================

* most repeats fragmented by insertions or deletions
have been counted as one element

The query species was assumed to be homo

run with rmblastn version 2.6.0+
The query was compared to unclassified sequences in ".../repeatlib.fa"


thank you!!

I think for elements to show up the repeat library fasta headers needs to have a specific format eg.

>seq1#LTR/ERV1

the masking did happen, cfr this line :

bases masked:     697079 bp ( 66.23 %)


but as microfuge , pointed out the summary table might be incomplete because it's just not able to classify the found repeats correctly. In essence that's not a big issue as the most important thing is that it did mask what needed to be masked

This is correct, the output summary table checks for mostly human repeats - there is a script called buildSummary.pl in the util folder of RepeatMasker which builds a better summary based on the .out files

See this for an output example RepeatMasker:understanding buildSummary.pl output

Ok. So, the elements are not reported in this table, but, probably I'll find them in the mygenome.out.fa, right? The file .out.tbl is not essential for me, I don't need to construct the new summary

don't know by heart but there is certainly an output file (might be the out.tbl ? ) that denotes which elements have been used to mask a certain region, using the fastaIDs from the library you provided

here I report some elements as exampe of my library:

>Gypsy-5-I_BF1 RB:3e-08 89% 86
GGTCAATAGGAGGTTGGATCTTAGTTGGCAGGGTGGTTTTATATTTCCTGCCATTCAGCATTTCTGCTGGGGATTTCATGTCAGCT
>Penelope-9_HM_Penelope_Hydra1 RB:2e-08 88% 267
AAGTTTCGTAAATCGCCATACAAGAACCAACATTTGAAATATCTTAATACTGTTACCAAACAAGTGAAAAGTGATAAAGGAATTTTCGTTAAATCTGACAAGACTAGAAATATTTATAAACTGAATAAGGAGCATTACATGAATTTACTTAGGAAGGAGATTGAAAAAAATTATAAAATTACAAATGGATGGACGCTCAGAAAGACCAATTTGGATGTTAAGAAACTAATGGAGAAATATAATATTGCGGACAGAACTGAACCTATA


Is not able the program to recognize elements like these?