Question

how do I run repeat masker

4

Entering edit mode

8.4 years ago

ksi216 ▴ 80

Hello, I'm new to unix and and I installed repeat masker but I'm clueless as to the commands that I enter to run it. Thanks

repeatmasker • 21k views

ADD COMMENT • link updated 3 months ago by Andrzej Zielezinski 11k • written 8.4 years ago by ksi216 ▴ 80

2

Entering edit mode

Why would you want to run RepeatMasker? I'm asking this honestly, because I have no idea why people do this. It generally seems like a bad idea. So, what are you trying to accomplish?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Brian Bushnell 20k

8

Entering edit mode

Repeat identification and masking is usually the first step in the genome annotation. Left unmasked, repeats can seed millions of spurious BLAST alignments, producing false evidence for gene annotations. Worse still, many transposon open reading frames (ORFs) look like true host genes to gene predictors (e.g. FGENESH, Augustus, GENSCAN, SNAP), causing portions of transposon ORFs to be added as additional exons to gene predictions, completely corrupting the final gene annotations. Good repeat masking is crucial for the accurate annotation of protein-coding genes.

ADD REPLY • link 8.4 years ago by Andrzej Zielezinski 11k

2

Entering edit mode

Thanks for the explanation. That's roughly the same as what the JGI annotation group tells me. But I still don't understand it. What do you mean by "repeats can seed millions of spurious BLAST alignments"? In my opinion... that actually means, there are millions of legitimate alignments that you wish to ignore, because that would be convenient for your publication. How do you decide an alignment is something you want to ignore?

As far as I can tell, repeat-masking is something people do so that current inadequate software produces sort-of-reasonable output. At JGI, we transitioned from Illumina fungal assemblies to PacBio assemblies. The PacBio assemblies are vastly more accurate because they can correctly resolve long repeats. Initially, the fungal annotation group hated these new PacBio assemblies, because they contain repeats, and broke their current software. But now, they are adjusting, because they finally understand that the PacBio assemblies are actually the truth (or, at least, closer to the truth) compared to assemblies based on short Illumina reads.

I believe that masking is very useful for conservative contaminant removal, to ensure that there is no possibility of false-positive contaminant identification. But running RepeatMasker is asinine. It sounds like people want to run it to speed up their BLAST searches, or use it to filter out legitimate hits that are inconvenient.

If you are a legitimate researcher, you need to examine all hits. If you publish a paper saying "The top hit was X, therefore X has the greatest effect on Y", great! But, if you state that, based on mapping to repeat-masked genomes, then your results may be valid, and they may not be valid. It depends on things outside of your control.

Personally, I think RepeatMasker is a piece of crap. Normally, when I feel this way, I write a superior alternative. But in this case I feel that RepeatMasker is a detriment to humanity and should be extinguished. It would certainly be nice if genomes contained no repeats. But, they do; repeats are important and need to be dealt with, rather than ignored or masked. There are a lot of people who annotate genomes, and obviously, it would be easier for them if all genomes were repeat-free, and had no transposons, etc. But that's not the real world. In the real world, people have to annotate actual assemblies, that contain actual repeats. It's nice to live in an imaginary world of Illumina assemblies that have a maximum read length of 300bp. But the modern world has PacBio 30kbp sequences, and can correctly assemble organisms containing very long repeats.

So - in my opinion, RepeatMasker is a great tool for people with no bioinformatics knowledge, who want to publish massive amounts of crap, and could not care less about directing future scientists. If you actually care about the real world, please use real unmasked data.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Brian Bushnell 20k

1

Entering edit mode

You don't seem to understand the comment you responded to at all or the point of repeat-masking for gene annotation. The real problem with this type of viewpoint is that you don't understand the biology in the first place, then you develop yet another undocumented, untested tool that you alone deem as being "superior" while in reality, it is of no actual use to biology. I'm not making a personal statement, so don't be offended. This is my experience in working at numerous institutes. A lot people say things like this and write tools that are faster but the rationale behind the approach is complete nonsense. I really worry about this issue because biologists often don't think about the tools they are using.

In most plants, the largest ORFs are from transposons, which may carry their own internal promoters and have numerous coding domains. I've spent my entire academic career to studying transposons and I can tell you that it is very difficult to distinguish host genes from transposons, especially in non-model systems. If you write a tool that is superior for this purpose, I'll be the first to use it. I'll add that I don't use RM for repeat identification, but the masking approach is sound.

edit: Think about it this way, RM is probably them most ubiquitous tool in bioinformatics behind BLAST, is that because everyone in biology doesn't have an idea what they are doing or could it be that you don't understand? I'm not trying to be argumentative, and I agree with you that RM could be better, but the approach is obviously robust and supported by decades of experimentation.

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.4 years ago by SES 8.6k

1

Entering edit mode

You don't seem to understand the comment you responded to at all or the point of repeat-masking for gene annotation. The real problem with this type of viewpoint is that you don't understand the biology in the first place,

That is absolutely correct, which is why I clearly stated that I don't understand the point of repeat-masking. And I agree, I don't understand the biology behind it, either. Is there a good reason for repeat-masking? Maybe! But I have yet to hear it, and it certainly has not been described in this thread.

I have talked people that like to do masking prior to annotation, and they were unable to provide an informative description of what they want to mask, or why they want to mask it. Why is that? Basically, they use some broken software that gives incorrect results when it's run on a good assembly.

If you want correct answers, the solution is not to use random pieces of ancient software that mask a huge portion of your reference... but, rather, to map to everything, and see what your sequences map to best.

It is not prudent to unquestioningly use some protocol or software just because lots of other people use it.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Consider the first genome papers that came out around 15 years ago, which were Arabidopsis and human. Both predicted about 100k genes in each species. This was a gross overestimate due to the lack of computational tools to identify transposons and to mask them prior to gene prediction. There are many papers on this subject. Yes, long reads will help resolve repeats, but not solve the problem, nor help our current assemblies we have to study now. I'm committed to working on this problem but it is very challenging.

The major complication in annotation is that many TEs insert into genes, and in fact, all human genes have Alu insertions. It is very difficult to identify these events, and because TEs make up the major of DNA on the planet, that makes masking a very import task. Thus, RM is not surprisingly an important tool. As I said before, I'm not a huge advocate of this specific software, but it is maintained by a team of developers, has great documentation and it works on every OS. To say it is ancient and is 'crap' is off-base and makes you sound kind of ridiculous. Repeat identification, and masking, is far from a solved problem and being disparaging about current approaches doesn't really help. I'd be happy to discuss approaches for better tools, or explain the limitations of the current tools, because that is what I work on.

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.4 years ago by SES 8.6k

0

Entering edit mode

all human genes have Alu insertions

Oh, come on. I've worked with human genetics, and that's not true. Unless you mean "there exists a human somewhere with this mutation in a specific gene", which is irrelevant to the human genome.

Even if it was true, gene annotation software should simply deal repeated elements, rather than requiring them to be masked.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Please do a basic web search of the topics mentioned above.

ADD REPLY • link 8.4 years ago by SES 8.6k

0

Entering edit mode

No matter how pervasive repeats may be, that does not excuse masking them prior to annotation. If an annotation program cannot handle repeated sequences, then the program needs to be improved.

ADD REPLY • link updated 4.6 years ago by Ram 43k • written 8.4 years ago by Brian Bushnell 20k

0

Entering edit mode

Are you referring to a tool that could predict TEs and genes? That would be great in theory, but the complexity of this task is enormous.

It is a bit puzzling why you are clinging to this idea about repeat masking. You asked a legitimate question about masking, and you got very clear answers. But, it sounds like you have yourself convinced of this opinion on the subject and you have chosen not to believe anyone. Try to keep an open mind would be my suggestion. There is decades of research to support these approaches, which is what I was trying to express in my last comment (though it was a bit terse and could have been stated better). If you can find any evidence to support your view, that would be justification in my mind for discussing alternative approaches. I'd be happy to discuss this further if there is some tangible reason you can provide for not masking, but otherwise this would not be productive. Cheers.

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.4 years ago by SES 8.6k

0

Entering edit mode

I thought I would provide some links for the sake of discussion and try to explain the issue better, since this is what I study. Here is a good paper on the subject: Consistent over-estimation of gene number in complex plant genomes. The section on repeat masking in the Maker documentation also describes some of the reasons mentioned above about the need to mask genomes prior to gene annotation. In addition to being transcribed in many species, TEs have many hallmark features of host genes, they may contain gene fragments, and they insert into genes and other TEs. This creates a complex landscape in the genome, which is far from being random but it presents enormous computational and biological challenges. The main issue with gene annotation is not "repeat" sequences in a mathematical sense. The issue is with biological features that appear unique and contain protein and transcriptome support, ORFs, promoters, etc. The result is that gene number is going to be over-estimated by a long shot if these factors are not taken into account.

ADD REPLY • link updated 21 months ago by Ram 43k • written 8.4 years ago by SES 8.6k

0

Entering edit mode

thanks I got it to work

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by ksi216 ▴ 80

Ram · Answer 1 · 2015-12-18

7

Entering edit mode

8.4 years ago

Andrzej Zielezinski 11k

Running RepeatMasker is pretty straightforward:

RepeatMasker --species arabidopsis yoursequence.fasta

To see a full list of options run: RepeatMasker -h

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 8.4 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

it says RepeatMasker : Command not found

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by ksi216 ▴ 80

1

Entering edit mode

It's seems the RepeatMasker is not exported to your PATH. Try: /usr/local/RepeatMasker/RepeatMasker -h.

ADD REPLY • link 8.4 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

now it says no such file or directory. did I install it wrong?

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by ksi216 ▴ 80

0

Entering edit mode

http://postimg.org/image/us5q6444t/

ADD REPLY • link 8.4 years ago by ksi216 ▴ 80

2

Entering edit mode

Okay, stay in ~/BI7533/RepeatMasker and run as this: ./RepeatMasker -h

ADD REPLY • link 8.4 years ago by Andrzej Zielezinski 11k

0

Entering edit mode

Hello and thanks for the nice software (I assume you are the developer of it?).

Can I run RepeatMasker on already masked fasta sequence file? It was, presumably, masked by RepeatMasker before by the ENSEMBL people. I only want to do this because I want to obtain a gff/gtf file of these masked sequences, which would have normally been produced by the ENSEMBL people during their application of RepeatMasker on the toplevel assembly genome, but sadly they don't provide it on their FTP severs.

Or do I have to re-do it from the very beginning: from the vanilla toplevel genome assembly??

ADD REPLY • link 3 months ago by e.r.zakiev ▴ 200

0

Entering edit mode

Hi! I am not the developer of RepeatMasker. Unfortunately, ENSEMBL does not provide information on the masked sequences. To get it you would need to re-run RepeatMasker on the top-level assembly genome.

ADD REPLY • link 3 months ago by Andrzej Zielezinski 11k