Question: Coverage Estimates for Masked Genome
0
gravatar for Magpie101
18 months ago by
Magpie1010
Magpie1010 wrote:

Does anyone know how to remove repetitive elements from a genome masked with RepeatMasker? What I'm trying to do is get an estimate of the fraction of the reference genome covered with Illumina reads from two mapping runs with BWA: the first run not taking into account repetitive elements and the second with repetitive elements removed so that I have an estimate of the fraction covered of the 'mappable' regions of the genome.

The masked regions are in lower case while the rest is in upper case. I thought I could just use 'find and replace' to remove the lower case bases but I can't open genome size files in my text editor.

Hope this makes sense and thanks in advance.

masked repeats assembly • 583 views
ADD COMMENTlink modified 6 months ago by Biostar ♦♦ 20 • written 18 months ago by Magpie1010
1
gravatar for Devon Ryan
18 months ago by
Devon Ryan79k
Freiburg, Germany
Devon Ryan79k wrote:

You can likely just download a hardmasked version of the genome from UCSC.

If you need to hardmask the file yourself, use tr [actg] [NNNN] < file.fa > hardmasked.fa or something like that (note that the chromosome names might get screwed up).

ADD COMMENTlink written 18 months ago by Devon Ryan79k

Hi Devon, thanks for your reply. What I'm trying to do is completely remove the repetitive elements from the sequences. So for example in a soft-masked sequence GAATCggactTTAC becomes GAATCTTAC. With a hardmasked genome if I remove all N's then I'll also remove missing data.

ADD REPLYlink written 18 months ago by Magpie1010

Eek, that's a great way to produce a meaningless metric. I strongly encourage you to only hard-mask (and even that's extreme, since you can at least partially align uniquely to repeat regions). So while I could show you how to do what you want, I won't.

ADD REPLYlink written 18 months ago by Devon Ryan79k

Hi Devon, I did manage to work it out simply using find and replace in the Linux Ubuntu system I'm using. Then came to the same conclusion as you :)

What I'm actually after is the effective genome size (or 'mappability' of the reference genome). I'm going to try out GEM. Unfortunately we have no bioinformaticians in my research group and rarely work on genome size datasets so it's all pretty new (and complex) to me.

Cheers

ADD REPLYlink written 18 months ago by Magpie1010
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1032 users visited in the last hour