1
0
Entering edit mode
7 days ago
Shri hari ▴ 20

Hi,

I tried to repeat mask a plant genome for an analysis using the following command

./RepeatMasker -species name of specie -s -a -poly -dir out_file path path_to genomefile.fna


when I'm checking by annotating the masked and unmasked genome I'm getting the same number of genes so i guess that repeat masking is not being carried out properly

2
Entering edit mode
7 days ago
colindaven ★ 2.7k

Repeatmasking will not remove genes to my knowledge. For that, you'll need to reannotate the masked file (or did you do this?).

Generally, repeat masking will annotate repetitive DNA regions, and either soft-mask (replace with lower case letters by some conventions, eg atgc instead of ATGC) or hard-mask (replace ATGC with NNNN).

Try running Gmap with a transcript set on your original and hard-masked outputs if you want to do a comparison. I wouldn't expect the number of genes/transcripts to change much, since the intergenic DNA is more likely to be repeat-masked.

0
Entering edit mode

Thank you for the reply Sorry for not framing my question properly I was trying to reduce the number of genes in the genome by masking the repeat regions I used the complete dfam library not the stock stock small one But to my surprise both the annotated files before and after masking shows the same number of genes I wonder if I'm missing a step or is there any mistake in the way which I'm doing it.

0
Entering edit mode

which tool are you using for gene prediction? It can be it does not take the masking into account (or that you need a different kind of masking for instance, soft vs hard masking (as @colindaven has said)

0
Entering edit mode

Thank you for the quick response I'm using Augustus for gene prediction

2
Entering edit mode

hmmm, that one should be able to take masking into account when doing prediction. What kind of masking did you do in your repeatmasker run? I think hardmasking as that is the default for repeatmasker.

You did provide the .masked version of your fasta file as input for the gene-prediction run (after masking)?

0
Entering edit mode

yes I went for the default one I used the masked version as input for gene-prediction

0
Entering edit mode

Is there any info on the runtime output of augustus that is is taken the repeat info into account?

What DB did you use to mask the genome? what is the output of repeatmasker in terms of how much sequence it could identify as repeat?

0
Entering edit mode

I used the complete dfam database I'm not aware of the run time info i will check if you can tell the file in which that info will be present

0
Entering edit mode

there should be some report file or such created by repeatmasker in the folder where you ran the cmdline (or what is printed to screen when you run it?)

0
Entering edit mode

The path to dfam database was given while configuring repeatmasker

The files created were with extensions .align, .masked, .polyout, .genomic.fasta.tbl, genomic.fasta.out the .tbl file loked like a report but all of the repetitive elements mentioned were showing number of elements 0% Except one unclassified: 264bp

1
Entering edit mode

I usually create a custom-species-specific lib to do the screening with

0
Entering edit mode

It's probably not a real solution, but I created a quick and dirty alternative to RepeatMasker for hard masking reference genomes here:

https://github.com/colindaven/blacklister

You could try that with the Dfam database ... at least for testing.

0
Entering edit mode

I'm thinking of making a custom TE library using RepeatModeller and use that for running RepeatMasker

hope that works

2
Entering edit mode

What i was able to understand is that the current version of Dfam 3.3 doesn't have a large library of plant TEs. RepBase has a much larger selection of plant TE families but that needs an subscription (for those who can get it it will be an easy solution)