Remove or mask repeat regions from .fasta file
2
2
Entering edit mode
8.9 years ago
aberry814 ▴ 80

I have a single fasta file genome that contains 40Mbps spread throughout ~30,000 separate sequences (contigs). About half is expected to be repetitive DNA. I am looking for a tool to either:

1) cut repeats from the original file and paste to a new fasta file

2) delete repeat regions from file

or 3) mask repeat regions (replace all repetitive sequences with N)

The first option is ideal, but for any of the three choices I want to be as liberal as possible with the definition of "repetitive DNA". I want to avoid any potential repeat at all costs. Losing good data is better than keeping repeat data in this scenario.

Note that I don't want to reduce the number of times a sequence is repeated, but I want to delete or mask every instance of that repeat so that it is not found a single time in my genome file.

Any suggestions for tools that will perform any of these tasks? Thanks!

genome sequence • 5.6k views
ADD COMMENT
3
Entering edit mode
8.9 years ago
Brice Sarver ★ 3.8k

The R package Biostrings, from Bioconductor, will accomplish this. Here's a tutorial, including specific examples on masking. It's fast.

Edit: the example uses regions identified from RepeatMasker, if you haven't ID'd them yet.

ADD COMMENT
0
Entering edit mode

Thanks, this looks good.

ADD REPLY
1
Entering edit mode
8.9 years ago

Which repeats do you have in mind? RepeatMasker will be great for interspersed repeats, but if also repetition of small k-mers bothers you (for example, 32-mer present in multiple places in the genome), then you could create gem mappability track which would show you which regions are unique and which are not, given that you provide k-mer you are interested in and number of mismatches (http://sourceforge.net/projects/gemlibrary/, http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0030377 )

ADD COMMENT
0
Entering edit mode

Thanks. I should have specified that I'm interesting in interspersed repeats, but I'll keep this in mind in case Repeat Masker isn't sufficient.

ADD REPLY

Login before adding your answer.

Traffic: 2592 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6