Question: how do I run repeat masker
1
gravatar for ksi216
2.9 years ago by
ksi21650
ksi21650 wrote:

Hello, Im new to unix and and i installed repeat masker but im clueless as to the commands that I enter to run it. thanks 

forum • 5.9k views
ADD COMMENTlink modified 2.9 years ago • written 2.9 years ago by ksi21650
2

Why would you want to run RepeatMasker?  I'm asking this honestly, because I have no idea why people do this.  It generally seems like a bad idea.  So, what are you trying to accomplish?

ADD REPLYlink written 2.9 years ago by Brian Bushnell16k
8

Repeat identification and masking is usually the first step in the genome annotation. Left unmasked, repeats can seed millions of spurious BLAST alignments, producing false evidence for gene annotations. Worse still, many transposon open reading frames (ORFs) look like true host genes to gene predictors (e.g. FGENESH, Augustus, GENSCAN, SNAP), causing portions of transposon ORFs to be added as additional exons to gene predictions, completely corrupting the final gene annotations. Good repeat masking is crucial for the accurate annotation of protein-coding genes.

ADD REPLYlink written 2.9 years ago by a.zielezinski8.5k
2

Thanks for the explanation.  That's roughly the same as what the JGI annotation group tells me.  But I still don't understand it.  What do you mean by "repeats can seed millions of spurious BLAST alignments"?  In my opinion...  that actually means, there are millions of legitimate alignments that you wish to ignore, because that would be convenient for your publication.  How do you decide an alignment is something you want to ignore?

As far as I can tell, repeat-masking is something people do so that current inadequate software produces sort-of-reasonable output. At JGI, we transitioned from Illumina fungal assemblies to PacBio assemblies.  The PacBio assemblies are vastly more accurate because they can correctly resolve long repeats.  Initially, the fungal annotation group hated these new PacBio assemblies, because they contain repeats, and broke their current software.  But now, they are adjusting, because they finally understand that the PacBio assemblies are actually the truth (or, at least, closer to the truth) compared to assemblies based on short Illumina reads.

I believe that masking is very useful for conservative contaminant removal, to ensure that there is no possibility of false-positive contaminant identification.  But running RepeatMasker is asinine.  It sounds like people want to run it to speed up their BLAST searches, or use it to filter out legitimate hits that are inconvenient.

If you are a legitimate researcher, you need to examine all hits.  If you publish a paper saying "The top hit was X, therefore X has the greatest effect on Y", great!  But, if you state that, based on mapping to repeat-masked genomes, then your results may be valid, and they may not  be valid.  It depends on things outside of your control.

Personally, I think RepeatMasker is a piece of crap.  Normally, when I feel this way, I write a superior alternative.  But in this case I feel that RepeatMasker is a detriment to humanity and should be extinguished.  It would certainly be nice if genomes contained no repeats. But, they do; repeats are important and need to be dealt with, rather than ignored or masked.  There are a lot of people who annotate genomes, and obviously, it would be easier for them if all genomes were repeat-free, and had no transposons, etc. But that's not the real world.  In the real world, people have to annotate actual assemblies, that contain actual repeats.  It's nice to live in an imaginary world of Illumina assemblies that have a maximum read length of 300bp.  But the modern world has PacBio 30kbp sequences, and can correctly assemble organisms containing very long repeats.

So - in my opinion, RepeatMasker is a great tool for people with no bioinformatics knowledge, who want to publish massive amounts of crap, and could not care less about directing future scientists.  If you actually care about the real world, please use real unmasked data.

ADD REPLYlink written 2.9 years ago by Brian Bushnell16k

You don't seem to understand the comment you responded to at all or the point of repeat-masking for gene annotation. The real problem with this type of viewpoint is that you don't understand the biology in the first place, then you develop yet another undocumented, untested tool that you alone deem as being "superior" while in reality, it is of no actual use to biology. I'm not making a personal statement, so don't be offended. This is my experience in working at numerous institutes. A lot people say things like this and write tools that are faster but the rationale behind the approach is complete nonsense. I really worry about this issue because biologists often don't think about the tools they are using.

In most plants, the largest ORFs are from transposons, which may carry their own internal promoters and have numerous coding domains. I've spent my entire academic career to studying transposons and I can tell you that it is very difficult to distinguish host genes from transposons, especially in non-model systems. If you write a tool that is superior for this purpose, I'll be the first to use it. I'll add that I don't use RM for repeat identification, but the masking approach is sound.

edit: Think about it this way, RM is probably them most ubiquitous tool in bioinformatics behind BLAST, is that because everyone in biology doesn't have an idea what they are doing or could it be that you don't understand? I'm not trying to be argumentative, and I agree with you that RM could be better, but the approach is obviously robust and supported by decades of experimentation.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by SES8.1k
1
You don't seem to understand the comment you responded to at all or the point of repeat-masking for gene annotation. The real problem with this type of viewpoint is that you don't understand the biology in the first place,

That is absolutely correct, which is why I clearly stated that I don't understand the point of repeat-masking.  And I agree, I don't understand the biology behind it, either.  Is there a good reason for repeat-masking?  Maybe!  But I have yet to hear it, and it certainly has not been described in this thread.

I have talked people that like to do masking prior to annotation, and they were unable to provide an informative description of what they want to mask, or why they want to mask it.  Why is that?  Basically, they use some broken software that gives incorrect results when it's run on a good assembly.

If you want correct answers, the solution is not to use random pieces of ancient software that mask a huge portion of your reference...  but, rather, to map to everything, and see what your sequences map to best.

It is not prudent to unquestioningly use some protocol or software just because lots of other people use it.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Brian Bushnell16k

Consider the first genome papers that came out around 15 years ago, which were Arabidopsis and human. Both predicted about 100k genes in each species. This was a gross overestimate due to the lack of computational tools to identify transposons and to mask them prior to gene prediction. There are many papers on this subject. Yes, long reads will help resolve repeats, but not solve the problem, nor help our current assemblies we have to study now. I'm committed to working on this problem but it is very challenging.

The major complication in annotation is that many TEs insert into genes, and in fact, all human genes have Alu insertions. It is very difficult to identify these events, and because TEs make up the major of DNA on the planet, that makes masking a very import task. Thus, RM is not surprisingly an important tool. As I said before, I'm not a huge advocate of this specific software, but it is maintained by a team of developers, has great documentation and it works on every OS. To say it is ancient and is 'crap' is off-base and makes you sound kind of ridiculous. Repeat identification, and masking, is far from a solved problem and being disparaging about current approaches doesn't really help. I'd be happy to discuss approaches for better tools, or explain the limitations of the current tools, because that is what I work on.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by SES8.1k
all human genes have Alu insertions

Oh, come on.  I've worked with human genetics, and that's not true.  Unless you mean "there exists a human somewhere with this mutation in a specific gene", which is irrelevant to the human genome.

Even if it was true, gene annotation software should simply deal repeated elements, rather than requiring them to be masked.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Brian Bushnell16k

Please do a basic web search of the topics mentioned above.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by SES8.1k

No matter how pervasive repeats may be, that does not excuse masking them prior to annotation.  If an annotation program cannot handle repeated sequences, then the program needs to be improved.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by Brian Bushnell16k

Are you referring to a tool that could predict TEs and genes? That would be great in theory, but the complexity of this task is enormous.

It is a bit puzzling why you are clinging to this idea about repeat masking. You asked a legitimate question about masking, and you got very clear answers. But, it sounds like you have yourself convinced of this opinion on the subject and you have chosen not to believe anyone. Try to keep an open mind would be my suggestion. There is decades of research to support these approaches, which is what I was trying to express in my last comment (though it was a bit terse and could have been stated better). If you can find any evidence to support your view, that would be justification in my mind for discussing alternative approaches. I'd be happy to discuss this further if there is some tangible reason you can provide for not masking, but otherwise this would not be productive. Cheers.

ADD REPLYlink written 2.9 years ago by SES8.1k

I thought I would provide some links for the sake of discussion and try to explain the issue better, since this is what I study. Here is a good paper on the subject: Consistent over-estimation of gene number in complex plant genomes. The section on repeat masking in the Maker documentation also describes some of the reasons mentioned above about the need to mask genomes prior to gene annotation. In addition to being transcribed in many species, TEs have many hallmark features of host genes, they may contain gene fragments, and they insert into genes and other TEs. This creates a complex landscape in the genome, which is far from being random but it presents enormous computational and biological challenges. The main issue with gene annotation is not "repeat" sequences in a mathematical sense. The issue is with biological features that appear unique and contain protein and transcriptome support, ORFs, promoters, etc. The result is that gene number is going to be over-estimated by a long shot if these factors are not taken into account.

ADD REPLYlink modified 2.9 years ago • written 2.9 years ago by SES8.1k
5
gravatar for a.zielezinski
2.9 years ago by
a.zielezinski8.5k
a.zielezinski8.5k wrote:

Running RepeatMasker is pretty straightforward:

RepeatMasker --species arabidopsis yoursequence.fasta

To see a full list of options run: RepeatMasker -h

NAME
    RepeatMasker - Mask repetitive DNA

SYNOPSIS
      RepeatMasker [-options] <seqfiles(s) in fasta format>

DESCRIPTION
    The options are:

    -h(elp)
        Detailed help

    Default settings are for masking all type of repeats in a primate
    sequence.

    -e(ngine) [crossmatch|wublast|abblast|ncbi|hmmer|decypher]
        Use an alternate search engine to the default.

    -pa(rallel) [number]
        The number of processors to use in parallel (only works for batch
        files or sequences over 50 kb)

    -s  Slow search; 0-5% more sensitive, 2-3 times slower than default

    -q  Quick search; 5-10% less sensitive, 2-5 times faster than default

    -qq Rush job; about 10% less sensitive, 4->10 times faster than default
        (quick searches are fine under most circumstances) repeat options

    -nolow /-low
        Does not mask low_complexity DNA or simple repeats

    -noint /-int
        Only masks low complex/simple repeats (no interspersed repeats)

    -norna
        Does not mask small RNA (pseudo) genes

    -alu
        Only masks Alus (and 7SLRNA, SVA and LTR5)(only for primate DNA)

    -div [number]
        Masks only those repeats < x percent diverged from consensus seq

    -lib [filename]
        Allows use of a custom library (e.g. from another species)

    -cutoff [number]
        Sets cutoff score for masking repeats when using -lib (default 225)

    -species <query species>
        Specify the species or clade of the input sequence. The species name
        must be a valid NCBI Taxonomy Database species name and be contained
        in the RepeatMasker repeat database. Some examples are:

          -species human
          -species mouse
          -species rattus
          -species "ciona savignyi"
          -species arabidopsis

        Other commonly used species:

        mammal, carnivore, rodentia, rat, cow, pig, cat, dog, chicken, fugu,
        danio, "ciona intestinalis" drosophila, anopheles, elegans,
        diatoaea, artiodactyl, arabidopsis, rice, wheat, and maize

    Contamination options

    -is_only
        Only clips E coli insertion elements out of fasta and .qual files

    -is_clip
        Clips IS elements before analysis (default: IS only reported)

    -no_is
        Skips bacterial insertion element check

    Running options

    -gc [number]
        Use matrices calculated for 'number' percentage background GC level

    -gccalc
        RepeatMasker calculates the GC content even for batch files/small
        seqs

    -frag [number]
        Maximum sequence length masked without fragmenting (default 60000,
        300000 for DeCypher)

    -nocut
        Skips the steps in which repeats are excised

    -noisy
        Prints search engine progress report to screen (defaults to .stderr
        file)

    -nopost
        Do not postprocess the results of the run ( i.e. call ProcessRepeats
        ). NOTE: This options should only be used when ProcessRepeats will
        be run manually on the results.

    output options

    -dir [directory name]
        Writes output to this directory (default is query file directory,
        "-dir ." will write to current directory).

    -a(lignments)
        Writes alignments in .align output file

    -inv
        Alignments are presented in the orientation of the repeat (with
        option -a)

    -lcambig
        Outputs ambiguous DNA transposon fragments using a lower case name.
        All other repeats are listed in upper case. Ambiguous fragments
        match multiple repeat elements and can only be called based on
        flanking repeat information.

    -small
        Returns complete .masked sequence in lower case

    -xsmall
        Returns repetitive regions in lowercase (rest capitals) rather than
        masked

    -x  Returns repetitive regions masked with Xs rather than Ns

    -poly
        Reports simple repeats that may be polymorphic (in file.poly)

    -source
        Includes for each annotation the HSP "evidence". Currently this
        option is only available with the "-html" output format listed
        below.

    -html
        Creates an additional output file in xhtml format.

    -ace
        Creates an additional output file in ACeDB format

    -gff
        Creates an additional Gene Feature Finding format output

    -u  Creates an additional annotation file not processed by
        ProcessRepeats

    -xm Creates an additional output file in cross_match format (for
        parsing)

    -fixed
        Creates an (old style) annotation file with fixed width columns

    -no_id
        Leaves out final column with unique ID for each element (was
        default)

    -e(xcln)
        Calculates repeat densities (in .tbl) excluding runs of >=20 N/Xs in
        the query

SEE ALSO
        Crossmatch, ProcessRepeats

COPYRIGHT
    Copyright 2007-2012 Arian Smit, Institute for Systems Biology

AUTHORS
    Arian Smit <asmit@systemsbiology.org>

    Robert Hubley <rhubley@systemsbiology.org>
ADD COMMENTlink written 2.9 years ago by a.zielezinski8.5k

it says RepeatMasker : Command not found 

ADD REPLYlink written 2.9 years ago by ksi21650
1

It's seems the RepeatMasker is not exported to your PATH. Try: /usr/local/RepeatMasker/RepeatMasker -h.

ADD REPLYlink written 2.9 years ago by a.zielezinski8.5k

now it says no such file or directory. did i install it wrong ? 

ADD REPLYlink written 2.9 years ago by ksi21650

http://postimg.org/image/us5q6444t/

ADD REPLYlink written 2.9 years ago by ksi21650
2

Okay, stay in ~/BI7533/RepeatMasker and run as this: ./RepeatMasker -h

ADD REPLYlink written 2.9 years ago by a.zielezinski8.5k
0
gravatar for ksi216
2.9 years ago by
ksi21650
ksi21650 wrote:

thanks i got it to work

ADD COMMENTlink written 2.9 years ago by ksi21650
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1280 users visited in the last hour