Question: Remove or mask repeat regions from .fasta file
gravatar for aberry814
4.1 years ago by
United States
aberry81440 wrote:

I have a single fasta file genome that contains 40Mbps spread throughout ~30,000 separate sequences (contigs). About half is expected to be repetitive DNA. I am looking for a tool to either:

1) cut repeats from the original file and paste to a new fasta file

2) delete repeat regions from file

or 3) mask repeat regions (replace all repetitive sequences with N)

The first option is ideal, but for any of the three choices I want to be as liberal as possible with the definition of “repetitive DNA”. I want to avoid any potential repeat at all costs. Losing good data is better than keeping repeat data in this scenario.

Note that I don't want to reduce the number of times a sequence is repeated, but I want to delete or mask every instance of that repeat so that it is not found a single time in my genome file. 

Any suggestions for tools that will perform any of these tasks? Thanks!

sequence genome • 2.6k views
ADD COMMENTlink modified 4.1 years ago by Biomonika (Noolean)3.1k • written 4.1 years ago by aberry81440
gravatar for Brice Sarver
4.1 years ago by
Brice Sarver2.6k
United States
Brice Sarver2.6k wrote:

The R package Biostrings, from Bioconductor, will accomplish this. Here's a tutorial, including specific examples on masking. It's fast.

Edit: the example uses regions identified from RepeatMasker, if you haven't ID'd them yet.

ADD COMMENTlink modified 4.1 years ago • written 4.1 years ago by Brice Sarver2.6k

Thanks, this looks good. 

ADD REPLYlink written 4.1 years ago by aberry81440
gravatar for Biomonika (Noolean)
4.1 years ago by
State College, PA, USA
Biomonika (Noolean)3.1k wrote:

Which repeats do you have in mind? RepeatMasker will be great for interspersed repeats, but if also repetition of small k-mers bothers you (for example, 32-mer present in multiple places in the genome), then you could create gem mappability track which would show you which regions are unique and which are not, given that you provide k-mer you are interested in and number of mismatches (

ADD COMMENTlink written 4.1 years ago by Biomonika (Noolean)3.1k

Thanks. I should have specified that I'm interesting in interspersed repeats, but I'll keep this in mind in case Repeat Masker isn't sufficient.

ADD REPLYlink written 4.1 years ago by aberry81440
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1299 users visited in the last hour