Question

Remove or mask repeat regions from .fasta file

2

Entering edit mode

8.9 years ago

aberry814 ▴ 80

I have a single fasta file genome that contains 40Mbps spread throughout ~30,000 separate sequences (contigs). About half is expected to be repetitive DNA. I am looking for a tool to either:

1) cut repeats from the original file and paste to a new fasta file

2) delete repeat regions from file

or 3) mask repeat regions (replace all repetitive sequences with N)

The first option is ideal, but for any of the three choices I want to be as liberal as possible with the definition of "repetitive DNA". I want to avoid any potential repeat at all costs. Losing good data is better than keeping repeat data in this scenario.

Note that I don't want to reduce the number of times a sequence is repeated, but I want to delete or mask every instance of that repeat so that it is not found a single time in my genome file.

Any suggestions for tools that will perform any of these tasks? Thanks!

genome sequence • 5.6k views

ADD COMMENT • link updated 16 months ago by Ram 43k • written 8.9 years ago by aberry814 ▴ 80

1

Entering edit mode

8.9 years ago

Biomonika (Noolean) 3.2k

Which repeats do you have in mind? RepeatMasker will be great for interspersed repeats, but if also repetition of small k-mers bothers you (for example, 32-mer present in multiple places in the genome), then you could create gem mappability track which would show you which regions are unique and which are not, given that you provide k-mer you are interested in and number of mismatches (http://sourceforge.net/projects/gemlibrary/, http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 )

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 8.9 years ago by Biomonika (Noolean) 3.2k

0

Entering edit mode

Thanks. I should have specified that I'm interesting in interspersed repeats, but I'll keep this in mind in case Repeat Masker isn't sufficient.

ADD REPLY • link 8.9 years ago by aberry814 ▴ 80

Ram · Accepted Answer · 2015-06-03

3

Entering edit mode

8.9 years ago

Brice Sarver ★ 3.8k

The R package Biostrings, from Bioconductor, will accomplish this. Here's a tutorial, including specific examples on masking. It's fast.

Edit: the example uses regions identified from RepeatMasker, if you haven't ID'd them yet.

ADD COMMENT • link updated 4.6 years ago by Ram 43k • written 8.9 years ago by Brice Sarver ★ 3.8k

0

Entering edit mode

Thanks, this looks good.

ADD REPLY • link updated 16 months ago by Ram 43k • written 8.9 years ago by aberry814 ▴ 80