Tutorial for de-novo repeat library construction
The RepeatMasker software includes a lot repeat library. You can query them using:
queryTaxonomyDatabase.pl -h
queryRepeatDatabase.pl -h
If there is no repeat library available for your species, you may want to create your own.
Prerequisite (Excepted ProtExcluder they are all available via bioconda)
- Lot of time (the repeatmodeler and transposonPSI steps can run for days depending provided resources)
- RepeatModeler
(installation might be difficult **/!\ do not use the current recipe (build1) in bioconda, it doesn't work properly** )Using Conda use repeatmodeler-1.0.11 build pl526_2 or superior (Previous build is bugged). - transposonPSI
- ProtExcluder
- blastp, blastx
- gaas_fasta_removeSeqFromIDlist.pl from GAAS.
- The fasta genome for which you want to define the repeats
1) De-novo - RepeatModeler:
/!\ RepeatModeler uses RepeatMasker for classification steps at the end. Without a complete installation of RepeatMasker you will end up with the file consensi.fa instead of consensi.fa.classified. So, if you installed RepeatModeler by conda you will get this error Missing ${CONDA_PREFIX}/share/RepeatMasker/Libraries/RepeatMasker.lib.nsq!
. Indeed this nucleotide library is not included by default. People tend to use RepBase as DB but it requires a license since last year. SO, if you wish to perform this classification step successfully please add a DB. see here for other details.
BuildDatabase –name genome -engine ncbi genome.fa
RepeatModeler –database genome -engine ncbi
You can use the option –pa to parallelise and speed it up a bit. This step is the longest step. At the end of this step you should have a file called consensi.fa.classified.
2) Filtering repeats:
The de-novo identification has a major drawback. Repeats are not always derived from ‘junk’ in the genome, but can also be part of actual protein-coding genes. It is therefore recommended to check the repeats against a comprehensive set of ‘real’ proteins from related organisms. If you are unsure what protein data set to run this against, simple use the one you were going to use for annotation. We call it <proteins.fa> here.
2.1)Mine (Retro-)Transposon protein Homologies.
transposonPSI.pl <proteins.fa> prot
You should get <proteins.fa>.TPSI.topHits file as output. From the resulting list, a collection of accession numbers with similarities to transposons can be generated.
awk '{if($0 ~ /^[^\/\/.*]/) print $5}' <proteins.fa>.TPSI.topHits | sort -u > accessions.list
2.2) Remove TEs from proteome. fasta_removeSeqFromIDlist.pl is from the GAAS repo.
fasta_removeSeqFromIDlist.pl -f <proteins.fa> -l accessions.list -o proteins.filtered.fa
2.3) Blast proteome against RepeatModeler library
makeblastdb –in proteins.filtered.fa –dbtype prot
blastx –db proteins.filtered.fa –query consensi.fa.classified –out blastx.out
you can use the –num_threads parameter to speed up the blasts step.
2.4) Remove hits from RepeatModeler library
Remark: The ProtExclider Manual says The package was developed using blastx output from ncbi-blast-2.2.28+ but is compatible with ncbi-blast-2.4.0+
. I tried with blast 2.9.0+ and I got Illegal division by zero
error. I tried blast 2.7.1+ and it works.
ProtExcluder.pl blastx.out consensi.fa.classified
The result should be a filtered repeat library called consensi.fa.classifiednoProtFinal. You can rename it or symlink it to the name of your choice e.g myrepeatlib.fa.
Great, thanks! Is the result of this as powerful as makers advanced repeat library preparation? Many of the steps are similar, but this is smaller.
You mean Repeat Library Construction-Advanced? For sure the approach I present here is less advanced, mainly because I don't look for LTR and MITEs elements. Their filtering steps are also more advanced. We can say it is a medium standard approach between the MAKER basic and the MAKER advanced. If you use RepeatModeler version 2 it can also look at LTR.
I ran
RepeatModeler -engine ncbi -database Repeats -pa 8
but I can not findconsensi.fa.classified
. Furthermore, the output folder contain the following files:What did I miss?
See step1, I will make it clearer
Thank you. I did:
Where, can I find a free RepBase alternative for academics? Or can I anyhow create it from Dfam?
Thank you in advance,
actually Dfam.hmm is needed only if you use nhmmer in repeatmasker as search engine. Otherwise it will be the Dfam.embl file to use. Then you need buildRMLibFromEMBL.pl to create the Fasta file from it and then you run
makeblastdb -dbtyp nucl ...
Are the below steps correct?
How to let RepeatModeler know to use Dfam?
Thank you in advance
The lib must be called
RepeatMasker.lib
Thanks @Juke-34 for the detailed steps, I ran repeatmodeler on my genome but because am using a local computer I stopped the run at round 5 then used repeat classifier to generate the consensi.fa.classified. Now I have tried to use transposonPSI.pl on the protein.fasta file obtained from uniprot but I ket getting this error message Error, formatdb -i transposonPSI.9003.../contig_1_pilon_pilon_pilon/contig_1_pilon_pilon_pilon.seq -p F (ret -1) at .../TransposonPSI_08222010/transposonPSI.pl line 115, <$filehandle> line 1. I don't seem to know what this error message means and how to solve it Thanks
Never seen this error before, sorry
@eennadi, I am having this error message while trying to run TransposonPSI. How do you manage to resolve it? I know it has been a long time you do this, however, your input would be much appreciated.
Thanks for this guide @Juke34,
1) I am a bit confused about what to exclude from the proteome: I made an annotation with maker and should I exclude all genes that are more or less close to TEs ? (like transposase, gag, pol, endonuclease, reverse transcriptase)
2) I would like also to combine the RepeatModeler sequences with the database provided by RepeatMasker (Dfam 3.1). Do you think the Dfam databse need also to be filtered (with ProtExcluder) or only the RepeatModeler sequences ?
1) To clean you repeat library (before using it to mask your genome) you must be sure the protein set you will use is free of TE.
I a bit confuse and don't get what you try to do. You should make your repeat library before masking your genome. MAKER first mask the genome with the libraries you provide before going into the gene annotation step. So everything related to preparing the repeat library has be done before running any thing with MAKER.
2) I don't know how good is DFAM (I always used RepBase). You could apply the same filtering (step 2) to it. I would use the reviewed proteins from Uniprot for it. You should report the result it to the community, that could help others to know if it is needed or not.
Ok I see.
I have also the Repbase database. Do you think it's useful to perform the step
2.3) Blast proteome against RepeatModeler library
on it ? Because they are curated one so it's not necessary no ?
The problem with Repbase is that the header is not formatted properly (ex:
>LINE1-11_SBi#
) and then I lose the classification on the .tbl file (they are flagged as 'Unspecified'): Do you have the same issue ?It is really useful for new repeat libraries. You do not need to perform anything for Repbase.
Maybe it is the term proteome that is misleading. It is not the proteome from the species you try to annotate. It is more a protein DB (fasta file here is needed) you use. You remove TE from it (step 2.1 2.2), then you use it (step 2.3 2.4) to remove from your de-novo Repeat library, the false positive (what have been classified as repeat while they are actualy protein coding genes that are e.g low complexity or/and highly duplicated in your genome e.g MHC genes in human).
Interesting, I never paid attention to it. Could you report it to the RepeatMasker team?
Thank you for your instruction, I have two questions:
However, I have processed this section to last part
ProtExcluder.pl
using transcriptome derived proteome. Is that acceptable to use such dataset?Thank you again!
Sounds acceptable but not optimum, indeed the transcriptome does not represent the whole genes (depending life stage, tissues, etc). Using Swissprot on top of it (transcriptome derived proteome of your species + SwissprotDB) should give you slightly better result I guess.
Yes you can combine the use of both librairies. I do not remember exactly but I guess it is just by concatenating both files.
Thank you for your advice!
Thanks so much for the guide!
In the last step you indicate that the output should be repeats.fanoProtFinal. Is this just a general term like *noProtFinal? My output was consensi.fa.classifiednoProtFinal. Did I go wrong somewhere? Or is this the correct final output?
You are right I will update the tutorial.
transposonPSI requires legacy blast, and itself has not really been maintained in years.
It is why I have packaged it into bioconda, as other tools (excepted ProtExcluder). So with conda you do not need to travel back in time :)
Ah, indeed. That's useful, thank you.
Even using the bioconda environment, on some systems you may need to add a data path for the legacy blastp. In my case (Conda 22 on Ubuntu 20.04) I had to place the file .ncbirc in my home directory, with the following content:
[NCBI] data=/usr/share/ncbi/data/
Are there any alternatives to ProtExcluder? I suddenly started to have problems with v1.2: The files blastx.out.fnolowm50seqm and blastx.out.fnolowm50seq are both the same size, and both contain only the following error message (although I have hmmer 3.3.2 installed correctly):
Usage: esl-sfetch [options] (one seq named ) Usage: esl-sfetch [options] -f (all seqs in ) Usage: esl-sfetch [options] --index (index )
To see more help on available options, do esl-sfetch -h
This is repeated 2429, ie however many sequences I have in the my.fasta file.
At the end of the procedure, I have only 800 out of 2500 interspersed elements left - seems to me that too much is eliminated.
I encountered a similar problem; have you resolved it?
Update1: I saw your post here: Alternatives to ProtExcluder in repeat annotation/MAKER?, I will give it a try.
Update 2: Following your suggested alternative method to filter out transcript-like repeat sequences, I successfully refined the original 2,167 consensi.fa.classified files down to 1,664. This result appears to be more effective than using ProtExcluder.