Coverage Estimates for Masked Genome
1
0
Entering edit mode
7.5 years ago
Magpie101 • 0

Does anyone know how to remove repetitive elements from a genome masked with RepeatMasker? What I'm trying to do is get an estimate of the fraction of the reference genome covered with Illumina reads from two mapping runs with BWA: the first run not taking into account repetitive elements and the second with repetitive elements removed so that I have an estimate of the fraction covered of the 'mappable' regions of the genome.

The masked regions are in lower case while the rest is in upper case. I thought I could just use 'find and replace' to remove the lower case bases but I can't open genome size files in my text editor.

Hope this makes sense and thanks in advance.

masked repeats Assembly • 1.7k views
ADD COMMENT
1
Entering edit mode
7.5 years ago

You can likely just download a hardmasked version of the genome from UCSC.

If you need to hardmask the file yourself, use tr [actg] [NNNN] < file.fa > hardmasked.fa or something like that (note that the chromosome names might get screwed up).

ADD COMMENT
0
Entering edit mode

Hi Devon, thanks for your reply. What I'm trying to do is completely remove the repetitive elements from the sequences. So for example in a soft-masked sequence GAATCggactTTAC becomes GAATCTTAC. With a hardmasked genome if I remove all N's then I'll also remove missing data.

ADD REPLY
0
Entering edit mode

Eek, that's a great way to produce a meaningless metric. I strongly encourage you to only hard-mask (and even that's extreme, since you can at least partially align uniquely to repeat regions). So while I could show you how to do what you want, I won't.

ADD REPLY
0
Entering edit mode

Hi Devon, I did manage to work it out simply using find and replace in the Linux Ubuntu system I'm using. Then came to the same conclusion as you :)

What I'm actually after is the effective genome size (or 'mappability' of the reference genome). I'm going to try out GEM. Unfortunately we have no bioinformaticians in my research group and rarely work on genome size datasets so it's all pretty new (and complex) to me.

Cheers

ADD REPLY

Login before adding your answer.

Traffic: 2710 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6