Build consensus sequences from repeat masker output
9 months ago

Hi,

So I have a repeat masker output file for a new organism (crustacean). And I want to use the transposable elements in this specie to analyse the piRNAs (using my own sequencing data=short reads).

The problem is I would like to get consensus sequences for transposable elements in this specie, instead of having each position in the genome where there is a transposon. Because if the same transposon exist in 100 copies in the genome I will have it 100 times in Repeatmasker.
Ideally I will like to get to a multifasta file like the ones in Repbase but I am a bit lost about how to use the Repeatmasker output to achieve this.

Any suggestion will be very helpful ! Thanks

repeatmasker Transposable elements
I think the easiest way would be to manipulate the coordinates as a bed file and then use bedtools to extract the sequences from the fasta. Once you have the fastas you can get a consensus

Thanks for the comment. I have already extracted the fasta sequences. I guess the way to move forward would be to do some sort of clustering on the sequences but I am not just sure about that.

You should have the name of the repeat, you can start with that and then get a consensus for each group.

6 months ago
bioinfo • 0

there is a script shipped with repeatMasker directory will solve your struggle I assume can be found here

What you can do is to get from the database all repetitions of the corresponding taxa you are interested in.

apply as below:

util/queryRepeatDatabase.pl -species YourSpecies  > YourSpecies_repetitions.lib