Tutorial:Create de novo repeat library
0
18
Entering edit mode
4.9 years ago
Juke34 8.9k

Tutorial for de-novo repeat library construction

The RepeatMasker software includes a lot repeat library. You can query them using:

queryTaxonomyDatabase.pl -h   
queryRepeatDatabase.pl -h

If there is no repeat library available for your species, you may want to create your own.
Prerequisite (Excepted ProtExcluder they are all available via bioconda)

  • Lot of time (the repeatmodeler and transposonPSI steps can run for days depending provided resources)
  • RepeatModeler (installation might be difficult **/!\ do not use the current recipe (build1) in bioconda, it doesn't work properly** ) Using Conda use repeatmodeler-1.0.11 build pl526_2 or superior (Previous build is bugged).
  • transposonPSI
  • ProtExcluder
  • blastp, blastx
  • gaas_fasta_removeSeqFromIDlist.pl from GAAS.
  • The fasta genome for which you want to define the repeats

1) De-novo - RepeatModeler:

/!\ RepeatModeler uses RepeatMasker for classification steps at the end. Without a complete installation of RepeatMasker you will end up with the file consensi.fa instead of consensi.fa.classified. So, if you installed RepeatModeler by conda you will get this error Missing ${CONDA_PREFIX}/share/RepeatMasker/Libraries/RepeatMasker.lib.nsq!. Indeed this nucleotide library is not included by default. People tend to use RepBase as DB but it requires a license since last year. SO, if you wish to perform this classification step successfully please add a DB. see here for other details.

BuildDatabase –name genome -engine ncbi genome.fa
RepeatModeler –database genome -engine ncbi

You can use the option –pa to parallelise and speed it up a bit. This step is the longest step. At the end of this step you should have a file called consensi.fa.classified.

2) Filtering repeats:

The de-novo identification has a major drawback. Repeats are not always derived from ‘junk’ in the genome, but can also be part of actual protein-coding genes. It is therefore recommended to check the repeats against a comprehensive set of ‘real’ proteins from related organisms. If you are unsure what protein data set to run this against, simple use the one you were going to use for annotation. We call it <proteins.fa> here.

2.1)Mine (Retro-)Transposon protein Homologies.

transposonPSI.pl <proteins.fa> prot

You should get <proteins.fa>.TPSI.topHits file as output. From the resulting list, a collection of accession numbers with similarities to transposons can be generated.

awk '{if($0 ~ /^[^\/\/.*]/) print $5}' <proteins.fa>.TPSI.topHits | sort -u > accessions.list

2.2) Remove TEs from proteome. fasta_removeSeqFromIDlist.pl is from the GAAS repo.

fasta_removeSeqFromIDlist.pl -f <proteins.fa> -l accessions.list -o proteins.filtered.fa

2.3) Blast proteome against RepeatModeler library

makeblastdb –in proteins.filtered.fa –dbtype prot
blastx –db proteins.filtered.fa –query consensi.fa.classified –out blastx.out

you can use the –num_threads parameter to speed up the blasts step.

2.4) Remove hits from RepeatModeler library
Remark: The ProtExclider Manual says The package was developed using blastx output from ncbi-blast-2.2.28+ but is compatible with ncbi-blast-2.4.0+. I tried with blast 2.9.0+ and I got Illegal division by zero error. I tried blast 2.7.1+ and it works.

ProtExcluder.pl blastx.out consensi.fa.classified

The result should be a filtered repeat library called consensi.fa.classifiednoProtFinal. You can rename it or symlink it to the name of your choice e.g myrepeatlib.fa.

de-novo repeat annotation • 11k views
ADD COMMENT
0
Entering edit mode

Great, thanks! Is the result of this as powerful as makers advanced repeat library preparation? Many of the steps are similar, but this is smaller.

ADD REPLY
1
Entering edit mode

You mean Repeat Library Construction-Advanced? For sure the approach I present here is less advanced, mainly because I don't look for LTR and MITEs elements. Their filtering steps are also more advanced. We can say it is a medium standard approach between the MAKER basic and the MAKER advanced. If you use RepeatModeler version 2 it can also look at LTR.

ADD REPLY
0
Entering edit mode

I ran RepeatModeler -engine ncbi -database Repeats -pa 8 but I can not find consensi.fa.classified. Furthermore, the output folder contain the following files:

ls RM_56831.ThuMar191408352020/
consensi.fa  families.stk  round-1  round-2  round-3  round-4  round-5  round-6  tmpBlastXResults.out  tmpBlastXResults.out.bxsummary  tmpConsensi.fa  tmpConsensi.fa.masked

What did I miss?

ADD REPLY
0
Entering edit mode

See step1, I will make it clearer

ADD REPLY
0
Entering edit mode

Thank you. I did:

   > wget -c https://www.dfam.org/releases/Dfam_3.1/families/Dfam.hmm.gz
   > cp Dfam.hmm ${CONDA_PREFIX}/share/RepeatMasker/Libraries

Where, can I find a free RepBase alternative for academics? Or can I anyhow create it from Dfam?

Thank you in advance,

ADD REPLY
0
Entering edit mode

actually Dfam.hmm is needed only if you use nhmmer in repeatmasker as search engine. Otherwise it will be the Dfam.embl file to use. Then you need buildRMLibFromEMBL.pl to create the Fasta file from it and then you run makeblastdb -dbtyp nucl ...

ADD REPLY
0
Entering edit mode

Are the below steps correct?

> wget -c https://www.dfam.org/releases/Dfam_3.1/families/Dfam.embl.gz
> gunzip Dfam.embl.gz 
> mv Dfam.embl ${CONDA_PREFIX}/share/RepeatMasker/Libraries/
> cd ${CONDA_PREFIX}/share/RepeatMasker/Libraries/
> ../util/buildRMLibFromEMBL.pl Dfam.embl > Dfam.lib
> makeblastdb -dbtype nucl -in Dfam.lib

How to let RepeatModeler know to use Dfam?

Thank you in advance

ADD REPLY
0
Entering edit mode

The lib must be called RepeatMasker.lib

../util/buildRMLibFromEMBL.pl Dfam.embl > RepeatMasker.lib
makeblastdb -dbtype nucl -in RepeatMasker.lib
ADD REPLY
0
Entering edit mode

Thanks @Juke-34 for the detailed steps, I ran repeatmodeler on my genome but because am using a local computer I stopped the run at round 5 then used repeat classifier to generate the consensi.fa.classified. Now I have tried to use transposonPSI.pl on the protein.fasta file obtained from uniprot but I ket getting this error message Error, formatdb -i transposonPSI.9003.../contig_1_pilon_pilon_pilon/contig_1_pilon_pilon_pilon.seq -p F (ret -1) at .../TransposonPSI_08222010/transposonPSI.pl line 115, <$filehandle> line 1. I don't seem to know what this error message means and how to solve it Thanks

ADD REPLY
0
Entering edit mode

Never seen this error before, sorry

ADD REPLY
0
Entering edit mode

@eennadi, I am having this error message while trying to run TransposonPSI. How do you manage to resolve it? I know it has been a long time you do this, however, your input would be much appreciated.

ADD REPLY
0
Entering edit mode

Thanks for this guide @Juke34,

1) I am a bit confused about what to exclude from the proteome: I made an annotation with maker and should I exclude all genes that are more or less close to TEs ? (like transposase, gag, pol, endonuclease, reverse transcriptase)

2) I would like also to combine the RepeatModeler sequences with the database provided by RepeatMasker (Dfam 3.1). Do you think the Dfam databse need also to be filtered (with ProtExcluder) or only the RepeatModeler sequences ?

ADD REPLY
0
Entering edit mode

1) To clean you repeat library (before using it to mask your genome) you must be sure the protein set you will use is free of TE.

I made an annotation with maker ...

I a bit confuse and don't get what you try to do. You should make your repeat library before masking your genome. MAKER first mask the genome with the libraries you provide before going into the gene annotation step. So everything related to preparing the repeat library has be done before running any thing with MAKER.

2) I don't know how good is DFAM (I always used RepBase). You could apply the same filtering (step 2) to it. I would use the reviewed proteins from Uniprot for it. You should report the result it to the community, that could help others to know if it is needed or not.

ADD REPLY
0
Entering edit mode

Ok I see.

I have also the Repbase database. Do you think it's useful to perform the step

2.3) Blast proteome against RepeatModeler library

on it ? Because they are curated one so it's not necessary no ?

The problem with Repbase is that the header is not formatted properly (ex: >LINE1-11_SBi# ) and then I lose the classification on the .tbl file (they are flagged as 'Unspecified'): Do you have the same issue ?

ADD REPLY
0
Entering edit mode

It is really useful for new repeat libraries. You do not need to perform anything for Repbase.

2.3) Blast proteome against RepeatModeler library

Maybe it is the term proteome that is misleading. It is not the proteome from the species you try to annotate. It is more a protein DB (fasta file here is needed) you use. You remove TE from it (step 2.1 2.2), then you use it (step 2.3 2.4) to remove from your de-novo Repeat library, the false positive (what have been classified as repeat while they are actualy protein coding genes that are e.g low complexity or/and highly duplicated in your genome e.g MHC genes in human).

The problem with Repbase is that the header is not formatted properly (ex: >LINE1-11_SBi# ) and then I lose the classification on the .tbl file (they are flagged as 'Unspecified'): Do you have the same issue ?

Interesting, I never paid attention to it. Could you report it to the RepeatMasker team?

ADD REPLY
0
Entering edit mode

Thank you for your instruction, I have two questions:

  1. I notice this message:

Maybe it is the term proteome that is misleading. It is not the proteome from the species you try to annotate.

However, I have processed this section to last part ProtExcluder.pl using transcriptome derived proteome. Is that acceptable to use such dataset?

  1. How can I combine my de novo repeat library with RepBase and input them together to Repeatmasker?

Thank you again!

ADD REPLY
1
Entering edit mode

Sounds acceptable but not optimum, indeed the transcriptome does not represent the whole genes (depending life stage, tissues, etc). Using Swissprot on top of it (transcriptome derived proteome of your species + SwissprotDB) should give you slightly better result I guess.

Yes you can combine the use of both librairies. I do not remember exactly but I guess it is just by concatenating both files.

ADD REPLY
0
Entering edit mode

Thank you for your advice!

ADD REPLY
0
Entering edit mode

Thanks so much for the guide!

In the last step you indicate that the output should be repeats.fanoProtFinal. Is this just a general term like *noProtFinal? My output was consensi.fa.classifiednoProtFinal. Did I go wrong somewhere? Or is this the correct final output?

ADD REPLY
0
Entering edit mode

You are right I will update the tutorial.

ADD REPLY
0
Entering edit mode

transposonPSI requires legacy blast, and itself has not really been maintained in years.

ADD REPLY
1
Entering edit mode

It is why I have packaged it into bioconda, as other tools (excepted ProtExcluder). So with conda you do not need to travel back in time :)

ADD REPLY
0
Entering edit mode

Ah, indeed. That's useful, thank you.

ADD REPLY
0
Entering edit mode

Even using the bioconda environment, on some systems you may need to add a data path for the legacy blastp. In my case (Conda 22 on Ubuntu 20.04) I had to place the file .ncbirc in my home directory, with the following content:

[NCBI] data=/usr/share/ncbi/data/

ADD REPLY
0
Entering edit mode

Are there any alternatives to ProtExcluder? I suddenly started to have problems with v1.2: The files blastx.out.fnolowm50seqm and blastx.out.fnolowm50seq are both the same size, and both contain only the following error message (although I have hmmer 3.3.2 installed correctly):

Usage: esl-sfetch [options] (one seq named ) Usage: esl-sfetch [options] -f (all seqs in ) Usage: esl-sfetch [options] --index (index )

To see more help on available options, do esl-sfetch -h

This is repeated 2429, ie however many sequences I have in the my.fasta file.

At the end of the procedure, I have only 800 out of 2500 interspersed elements left - seems to me that too much is eliminated.

ADD REPLY
0
Entering edit mode

I encountered a similar problem; have you resolved it?

Update1: I saw your post here: Alternatives to ProtExcluder in repeat annotation/MAKER?, I will give it a try.

Update 2: Following your suggested alternative method to filter out transcript-like repeat sequences, I successfully refined the original 2,167 consensi.fa.classified files down to 1,664. This result appears to be more effective than using ProtExcluder.

ADD REPLY

Login before adding your answer.

Traffic: 889 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6