Tutorial:Create de novo repeat library
0
13
Entering edit mode
2.1 years ago
Juke34 ★ 6.5k

Tutorial for de-novo repeat library construction

The RepeatMasker software includes a lot repeat library. You can query them using:

queryTaxonomyDatabase.pl -h
queryRepeatDatabase.pl -h

If there is no repeat library available for your species, you may want to create your own.
Prerequisite

• Lot of time (the repeatmodeler and transposonPSI steps can run for days depending provided resources)
• RepeatModeler (installation might be difficult **/!\ do not use the current recipe (build1) in bioconda, it doesn't work properly** ) Using Conda use repeatmodeler-1.0.11 build pl526_2 or superior (Previous build is bugged).
• transposonPSI
• ProtExcluder
• blastp, blastx
• gaas_fasta_removeSeqFromIDlist.pl from GAAS.
• The fasta genome for which you want to define the repeats

1) De-novo - RepeatModeler:

/!\ RepeatModeler uses RepeatMasker for classification steps at the end. Without a complete installation of RepeatMasker you will end up with the file consensi.fa instead of consensi.fa.classified. So, if you installed RepeatModeler by conda you will get this error Missing ${CONDA_PREFIX}/share/RepeatMasker/Libraries/RepeatMasker.lib.nsq!. Indeed this nucleotide library is not included by default. People tend to use RepBase as DB but it requires a license since last year. SO, if you wish to perform this classification step successfully please add a DB. see here for other details. BuildDatabase –name genome -engine ncbi genome.fa RepeatModeler –database genome -engine ncbi You can use the option –pa to parallelise and speed it up a bit. This step is the longest step. At the end of this step you should have a file called consensi.fa.classified. 2) Filtering repeats: The de-novo identification has a major drawback. Repeats are not always derived from ‘junk’ in the genome, but can also be part of actual protein-coding genes. It is therefore recommended to check the repeats against a comprehensive set of ‘real’ proteins from related organisms. If you are unsure what protein data set to run this against, simple use the one you were going to use for annotation. We call it <proteins.fa> here. 2.1)Mine (Retro-)Transposon protein Homologies. transposonPSI.pl <proteins.fa> prot You should get <proteins.fa>.TPSI.topHits file as output. From the resulting list, a collection of accession numbers with similarities to transposons can be generated. awk '{if($0 ~ /^[^\/\/.*]/) print $5}' <proteins.fa>.TPSI.topHits | sort -u > accessions.list 2.2) Remove TEs from proteome. fasta_removeSeqFromIDlist.pl is from the GAAS repo. fasta_removeSeqFromIDlist.pl -f <proteins.fa> -l accessions.list -o proteins.filtered.fa 2.3) Blast proteome against RepeatModeler library makeblastdb –in proteins.filtered.fa –dbtype prot blastx –db proteins.filtered.fa –query consensi.fa.classified –out blastx.out you can use the –num_threads parameter to speed up the blasts step. 2.4) Remove hits from RepeatModeler library Remark: The ProtExclider Manual says The package was developed using blastx output from ncbi-blast-2.2.28+ but is compatible with ncbi-blast-2.4.0+. I tried with blast 2.9.0+ and I got Illegal division by zero error. I tried blast 2.7.1+ and it works. ProtExcluder.pl blastx.out consensi.fa.classified The result should be a filtered repeat library called consensi.fa.classifiednoProtFinal. You can rename it or symlink it to the name of your choice e.g myrepeatlib.fa. Tutorial repeat annotation de-novo • 4.4k views ADD COMMENT 0 Entering edit mode Great, thanks! Is the result of this as powerful as makers advanced repeat library preparation? Many of the steps are similar, but this is smaller. ADD REPLY 1 Entering edit mode You mean Repeat Library Construction-Advanced? For sure the approach I present here is less advanced, mainly because I don't look for LTR and MITEs elements. Their filtering steps are also more advanced. We can say it is a medium standard approach between the MAKER basic and the MAKER advanced. If you use RepeatModeler version 2 it can also look at LTR. ADD REPLY 0 Entering edit mode I ran RepeatModeler -engine ncbi -database Repeats -pa 8 but I can not find consensi.fa.classified. Furthermore, the output folder contain the following files: ls RM_56831.ThuMar191408352020/ consensi.fa families.stk round-1 round-2 round-3 round-4 round-5 round-6 tmpBlastXResults.out tmpBlastXResults.out.bxsummary tmpConsensi.fa tmpConsensi.fa.masked What did I miss? ADD REPLY 0 Entering edit mode See step1, I will make it clearer ADD REPLY 0 Entering edit mode Thank you. I did: > wget -c https://www.dfam.org/releases/Dfam_3.1/families/Dfam.hmm.gz > cp Dfam.hmm${CONDA_PREFIX}/share/RepeatMasker/Libraries

Where, can I find a free RepBase alternative for academics? Or can I anyhow create it from Dfam?

0
Entering edit mode

actually Dfam.hmm is needed only if you use nhmmer in repeatmasker as search engine. Otherwise it will be the Dfam.embl file to use. Then you need buildRMLibFromEMBL.pl to create the Fasta file from it and then you run makeblastdb -dbtyp nucl ...

0
Entering edit mode

Are the below steps correct?

> wget -c https://www.dfam.org/releases/Dfam_3.1/families/Dfam.embl.gz
> gunzip Dfam.embl.gz
> mv Dfam.embl ${CONDA_PREFIX}/share/RepeatMasker/Libraries/ > cd${CONDA_PREFIX}/share/RepeatMasker/Libraries/
> ../util/buildRMLibFromEMBL.pl Dfam.embl > Dfam.lib
> makeblastdb -dbtype nucl -in Dfam.lib

How to let RepeatModeler know to use Dfam?

0
Entering edit mode

The lib must be called RepeatMasker.lib

0
Entering edit mode

Thanks @Juke-34 for the detailed steps, I ran repeatmodeler on my genome but because am using a local computer I stopped the run at round 5 then used repeat classifier to generate the consensi.fa.classified. Now I have tried to use transposonPSI.pl on the protein.fasta file obtained from uniprot but I ket getting this error message Error, formatdb -i transposonPSI.9003.../contig_1_pilon_pilon_pilon/contig_1_pilon_pilon_pilon.seq -p F (ret -1) at .../TransposonPSI_08222010/transposonPSI.pl line 115, <\$filehandle> line 1. I don't seem to know what this error message means and how to solve it Thanks

0
Entering edit mode

Never seen this error before, sorry

0
Entering edit mode

@eennadi, I am having this error message while trying to run TransposonPSI. How do you manage to resolve it? I know it has been a long time you do this, however, your input would be much appreciated.

0
Entering edit mode

Thanks for this guide @Juke34,

1) I am a bit confused about what to exclude from the proteome: I made an annotation with maker and should I exclude all genes that are more or less close to TEs ? (like transposase, gag, pol, endonuclease, reverse transcriptase)

2) I would like also to combine the RepeatModeler sequences with the database provided by RepeatMasker (Dfam 3.1). Do you think the Dfam databse need also to be filtered (with ProtExcluder) or only the RepeatModeler sequences ?

0
Entering edit mode

1) To clean you repeat library (before using it to mask your genome) you must be sure the protein set you will use is free of TE.

I made an annotation with maker ...

I a bit confuse and don't get what you try to do. You should make your repeat library before masking your genome. MAKER first mask the genome with the libraries you provide before going into the gene annotation step. So everything related to preparing the repeat library has be done before running any thing with MAKER.

2) I don't know how good is DFAM (I always used RepBase). You could apply the same filtering (step 2) to it. I would use the reviewed proteins from Uniprot for it. You should report the result it to the community, that could help others to know if it is needed or not.

0
Entering edit mode

Ok I see.

I have also the Repbase database. Do you think it's useful to perform the step

2.3) Blast proteome against RepeatModeler library

on it ? Because they are curated one so it's not necessary no ?

The problem with Repbase is that the header is not formatted properly (ex: >LINE1-11_SBi# ) and then I lose the classification on the .tbl file (they are flagged as 'Unspecified'): Do you have the same issue ?

0
Entering edit mode

It is really useful for new repeat libraries. You do not need to perform anything for Repbase.

2.3) Blast proteome against RepeatModeler library

Maybe it is the term proteome that is misleading. It is not the proteome from the species you try to annotate. It is more a protein DB (fasta file here is needed) you use. You remove TE from it (step 2.1 2.2), then you use it (step 2.3 2.4) to remove from your de-novo Repeat library, the false positive (what have been classified as repeat while they are actualy protein coding genes that are e.g low complexity or/and highly duplicated in your genome e.g MHC genes in human).

The problem with Repbase is that the header is not formatted properly (ex: >LINE1-11_SBi# ) and then I lose the classification on the .tbl file (they are flagged as 'Unspecified'): Do you have the same issue ?

Interesting, I never paid attention to it. Could you report it to the RepeatMasker team?

0
Entering edit mode

Thanks so much for the guide!

In the last step you indicate that the output should be repeats.fanoProtFinal. Is this just a general term like *noProtFinal? My output was consensi.fa.classifiednoProtFinal. Did I go wrong somewhere? Or is this the correct final output?

0
Entering edit mode

You are right I will update the tutorial.

Traffic: 2015 users visited in the last hour
FAQ
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.