Tutorial: Create de novo repeat library
gravatar for Juke34
13 months ago by
Juke344.9k wrote:

Tutorial for de-novo repeat library construction

The RepeatMasker software includes a lot repeat library. You can query them using:

queryTaxonomyDatabase.pl -h   
queryRepeatDatabase.pl -h

If there is no repeat library available for your species, you may want to create your own.

  • Lot of time (the repeatmodeler and transposonPSI steps can run for days depending provided resources)
  • RepeatModeler (installation might be difficult /!\ do not use the current recipe (build1) in bioconda, it doesn't work properly ) Using Conda use repeatmodeler-1.0.11 build pl526_2 or superior (Previous build is bugged).
  • transposonPSI
  • ProtExcluder
  • blastp, blastx
  • gaas_fasta_removeSeqFromIDlist.pl from GAAS.
  • The fasta genome for which you want to define the repeats

1) De-novo - RepeatModeler:

/!\ RepeatModeler uses RepeatMasker for classification steps at the end. Without a complete installation of RepeatMasker you will end up with the file consensi.fa instead of consensi.fa.classified. So, if you installed RepeatModeler by conda you will get this error Missing ${CONDA_PREFIX}/share/RepeatMasker/Libraries/RepeatMasker.lib.nsq!. Indeed this nucleotide library is not included by default. People tend to use RepBase as DB but it requires a license since last year. SO, if you wish to perform this classification step successfully please add a DB. see here for other details.

BuildDatabase –name genome -engine ncbi genome.fa
RepeatModeler –database genome -engine ncbi

You can use the option –pa to parallelise and speed it up a bit. This step is the longest step. At the end of this step you should have a file called consensi.fa.classified.

2) Filtering repeats:

The de-novo identification has a major drawback. Repeats are not always derived from ‘junk’ in the genome, but can also be part of actual protein-coding genes. It is therefore recommended to check the repeats against a comprehensive set of ‘real’ proteins from related organisms. If you are unsure what protein data set to run this against, simple use the one you were going to use for annotation. We call it <proteins.fa> here.

2.1)Mine (Retro-)Transposon protein Homologies.

transposonPSI.pl <proteins.fa> prot

You should get <proteins.fa>.TPSI.topHits file as output. From the resulting list, a collection of accession numbers with similarities to transposons can be generated.

awk '{if($0 ~ /^[^\/\/.*]/) print $5}' <proteins.fa>.TPSI.topHits | sort -u > accessions.list

2.2) Remove TEs from proteome. fasta_removeSeqFromIDlist.pl is from the GAAS repo.

fasta_removeSeqFromIDlist.pl -f <proteins.fa> -l accessions.list -o proteins.filtered.fa

2.3) Blast proteome against RepeatModeler library

makeblastdb –in proteins.filtered.fa –dbtype prot
blastx –db proteins.filtered.fa –query consensi.fa.classified –out blastx.out

you can use the –num_threads parameter to speed up the blasts step.

2.4) Remove hits from RepeatModeler library

ProtExcluder.pl blastx.out consensi.fa.classified

The result should be a filtered repeat library called consensi.fa.classifiednoProtFinal. You can rename it or symlink it to the name of your choice e.g myrepeatlib.fa.

ADD COMMENTlink modified 3 months ago • written 13 months ago by Juke344.9k

Great, thanks! Is the result of this as powerful as makers advanced repeat library preparation? Many of the steps are similar, but this is smaller.

ADD REPLYlink written 13 months ago by ricardoguerreiro212160

You mean Repeat Library Construction-Advanced? For sure the approach I present here is less advanced, mainly because I don't look for LTR and MITEs elements. Their filtering steps are also more advanced. We can say it is a medium standard approach between the MAKER basic and the MAKER advanced. If you use RepeatModeler version 2 it can also look at LTR.

ADD REPLYlink modified 13 months ago • written 13 months ago by Juke344.9k

I ran RepeatModeler -engine ncbi -database Repeats -pa 8 but I can not find consensi.fa.classified. Furthermore, the output folder contain the following files:

ls RM_56831.ThuMar191408352020/
consensi.fa  families.stk  round-1  round-2  round-3  round-4  round-5  round-6  tmpBlastXResults.out  tmpBlastXResults.out.bxsummary  tmpConsensi.fa  tmpConsensi.fa.masked

What did I miss?

ADD REPLYlink written 10 months ago by Ric330

See step1, I will make it clearer

ADD REPLYlink written 10 months ago by Juke344.9k

Thank you. I did:

   > wget -c https://www.dfam.org/releases/Dfam_3.1/families/Dfam.hmm.gz
   > cp Dfam.hmm ${CONDA_PREFIX}/share/RepeatMasker/Libraries

Where, can I find a free RepBase alternative for academics? Or can I anyhow create it from Dfam?

Thank you in advance,

ADD REPLYlink modified 9 months ago • written 9 months ago by Ric330

actually Dfam.hmm is needed only if you use nhmmer in repeatmasker as search engine. Otherwise it will be the Dfam.embl file to use. Then you need buildRMLibFromEMBL.pl to create the Fasta file from it and then you run makeblastdb -dbtyp nucl ...

ADD REPLYlink written 9 months ago by Juke344.9k

Are the below steps correct?

> wget -c https://www.dfam.org/releases/Dfam_3.1/families/Dfam.embl.gz
> gunzip Dfam.embl.gz 
> mv Dfam.embl ${CONDA_PREFIX}/share/RepeatMasker/Libraries/
> cd ${CONDA_PREFIX}/share/RepeatMasker/Libraries/
> ../util/buildRMLibFromEMBL.pl Dfam.embl > Dfam.lib
> makeblastdb -dbtype nucl -in Dfam.lib

How to let RepeatModeler know to use Dfam?

Thank you in advance

ADD REPLYlink written 8 months ago by Ric330

The lib must be called RepeatMasker.lib

../util/buildRMLibFromEMBL.pl Dfam.embl > RepeatMasker.lib
makeblastdb -dbtype nucl -in RepeatMasker.lib
ADD REPLYlink modified 8 months ago • written 8 months ago by Juke344.9k

Thanks @Juke-34 for the detailed steps, I ran repeatmodeler on my genome but because am using a local computer I stopped the run at round 5 then used repeat classifier to generate the consensi.fa.classified. Now I have tried to use transposonPSI.pl on the protein.fasta file obtained from uniprot but I ket getting this error message Error, formatdb -i transposonPSI.9003.../contig_1_pilon_pilon_pilon/contig_1_pilon_pilon_pilon.seq -p F (ret -1) at .../TransposonPSI_08222010/transposonPSI.pl line 115, <$filehandle> line 1. I don't seem to know what this error message means and how to solve it Thanks

ADD REPLYlink written 9 months ago by eennadi0

Never seen this error before, sorry

ADD REPLYlink written 9 months ago by Juke344.9k

Thanks for this guide @Juke34,

1) I am a bit confused about what to exclude from the proteome: I made an annotation with maker and should I exclude all genes that are more or less close to TEs ? (like transposase, gag, pol, endonuclease, reverse transcriptase)

2) I would like also to combine the RepeatModeler sequences with the database provided by RepeatMasker (Dfam 3.1). Do you think the Dfam databse need also to be filtered (with ProtExcluder) or only the RepeatModeler sequences ?

ADD REPLYlink modified 6 months ago • written 7 months ago by Picasa590

1) To clean you repeat library (before using it to mask your genome) you must be sure the protein set you will use is free of TE.

I made an annotation with maker ...

I a bit confuse and don't get what you try to do. You should make your repeat library before masking your genome. MAKER first mask the genome with the libraries you provide before going into the gene annotation step. So everything related to preparing the repeat library has be done before running any thing with MAKER.

2) I don't know how good is DFAM (I always used RepBase). You could apply the same filtering (step 2) to it. I would use the reviewed proteins from Uniprot for it. You should report the result it to the community, that could help others to know if it is needed or not.

ADD REPLYlink modified 6 months ago • written 6 months ago by Juke344.9k

Ok I see.

I have also the Repbase database. Do you think it's useful to perform the step

2.3) Blast proteome against RepeatModeler library

on it ? Because they are curated one so it's not necessary no ?

The problem with Repbase is that the header is not formatted properly (ex: >LINE1-11_SBi# ) and then I lose the classification on the .tbl file (they are flagged as 'Unspecified'): Do you have the same issue ?

ADD REPLYlink modified 6 months ago • written 6 months ago by Picasa590

It is really useful for new repeat libraries. You do not need to perform anything for Repbase.

2.3) Blast proteome against RepeatModeler library

Maybe it is the term proteome that is misleading. It is not the proteome from the species you try to annotate. It is more a protein DB (fasta file here is needed) you use. You remove TE from it (step 2.1 2.2), then you use it (step 2.3 2.4) to remove from your de-novo Repeat library, the false positive (what have been classified as repeat while they are actualy protein coding genes that are e.g low complexity or/and highly duplicated in your genome e.g MHC genes in human).

The problem with Repbase is that the header is not formatted properly (ex: >LINE1-11_SBi# ) and then I lose the classification on the .tbl file (they are flagged as 'Unspecified'): Do you have the same issue ?

Interesting, I never paid attention to it. Could you report it to the RepeatMasker team?

ADD REPLYlink written 6 months ago by Juke344.9k

Thanks so much for the guide!

In the last step you indicate that the output should be repeats.fanoProtFinal. Is this just a general term like *noProtFinal? My output was consensi.fa.classifiednoProtFinal. Did I go wrong somewhere? Or is this the correct final output?

ADD REPLYlink written 3 months ago by landas0

You are right I will update the tutorial.

ADD REPLYlink written 3 months ago by Juke344.9k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 2078 users visited in the last hour