Question: Lastz Genome Alignment Tool: How Does Softmasking Work?
1
gravatar for Sujai Kumar
6.7 years ago by
Sujai Kumar240
United Kingdom
Sujai Kumar240 wrote:

I need to do a whole genome alignment on a largish metazoan (~1.8 GBp), and the sensible approach seems to be to mask the repetitive alements before doing that so that the number of all versus all seed matches is reduced.

The Lastz whole genome alignment tool has a softmasking option where masked regions aren't used for finding seeds (but can be used when extending seeds to build alignments). That sounds perfect to me.

However, I can't figure out how to specify the softmasked regions. Does someone know how to do this?

The Lastz documentation at http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html#fmt_mask says

Sequence Masking File

This file is used with the xmask and nmask actions in a sequence specifier. It consists of one interval per line, without sequence names. Lines beginning with a # are considered to be comments and are ignored, as are blank lines. Only the first two whitespace-delimited words in any line are interpreted as the interval; the rest of the line is ignored.
Each interval describes a region to be masked, and consists of
    <start> <end>
Locations are one-based and inclusive on both ends (i.e., they use the origin-one, closed position numbering system). Note that the masking intervals are counted along the forward strand, even if we are only aligning to the reverse complement of the query specifier (i.e. for ‑‑strand=minus).
Here is an example. If the target sequence is hg18.chr1, this would mask the 5' UTRs from several genes. Note that the third column is neither required nor interpreted by LASTZ, and acts as a comment.
     884484  884542  NM_015658
     885830  885936  NM_198317
     891740  891774  NM_032129
     925217  925333  NM_021170
     938742  938816  NM_005101
     945366  945415  NM_198576
    1016787 1016808  NM_001114103
    1017234 1017346  NM_001114103
    1041303 1041486  NM_001114103

My question is - if the Target and Query files (for alignment) have multiple sequences, then how do you specify the masked regions if there is only a <start><end> specification? (you'd also need a <sequenceid> specifier).

Do you know if there is a way around this or a different tool that will do the same job with masked sequences?

alignment • 4.6k views
ADD COMMENTlink modified 2.7 years ago by Martin Stervander10 • written 6.7 years ago by Sujai Kumar240
2
gravatar for Istvan Albert
6.7 years ago by
Istvan Albert ♦♦ 84k
University Park, USA
Istvan Albert ♦♦ 84k wrote:

I think this has to do with the way lastz works, it was designed to work with a single target.

When you give it multiple targets they get concatenated. So you may need to create masking for concatenated sequences.

I would double check with the author on this though. You could use the request help via email link and enter his email address: rsharris@bx.psu.edu

ADD COMMENTlink written 6.7 years ago by Istvan Albert ♦♦ 84k
2

Lastz author here, came across this thread as part of a search for something else.  For the sake of completeness... here's the scoop on softmasking and lastz

If you do nothing else, any lowercase nucleotide in your sequence file will be considered as soft-masked.

Additionally, lastz can perform some operations on the sequence after it loads it from memory.  One of these is to apply a file of intervals, changing all the indicated bases to lowercase.

As for the masking file not recognizing sequence names, this does come from the design history of originally working only with a single sequence.  Recognizing three-column masking files was on my todo list for a while.  But at this point, I'm not actively making changes other than bug fixes.

The simplest workaround, rather than concatenating the whole input sequence, would be to pipe the sequence through something that does the softmasking, and then pipe it into lastz (or save it in a temporary file).

Bob H

 

 

ADD REPLYlink written 6.0 years ago by rsharris30

Thanks for the (potential) confirmation and suggestion for contacting the author. I'll do that

ADD REPLYlink written 6.7 years ago by Sujai Kumar240
1
gravatar for SES
6.7 years ago by
SES8.4k
Vancouver, BC
SES8.4k wrote:

Soft masking refers to converting repeats to lower-case, while hard masking refers to replacing those bases with Xs (or Ns since some software won't recognize X). If you take a look at the RepeatMasker Annotation request form you can see there are options for specifying whether you want your genome lower-case masked, or hard masked with Xs or Ns. If your species is not listed there it may be necessary to do the masking yourself.

As for the input to LASTZ, it's not clear to me if the soft masked genome is required or that file of coordinates. Either way, it should not be difficult to create the coordinates file if that is required.

ADD COMMENTlink written 6.7 years ago by SES8.4k

Thanks for this. I already have masking info (from RepeatMasker). From what I can tell LASTZ wants the coordinate files if you want to do soft masking. I can do hard masking on my own by replacing with Ns but what I really want is to avoid masked regions during seed finding, and allow masked regions during seed extensions. But the coordinate file only has Start-End, no sequenceID, so I may have to concatenate the original sequence (and the masking coordinates) as Istvan Albert suggested.

ADD REPLYlink written 6.7 years ago by Sujai Kumar240

I see, basically LASTZ is expecting one target to be represented in that mask file, not numerous sequences. Unless you could add sequences to the alignment iteratively somehow, I'm not sure what the best approach would be with LASTZ, sorry. Hopefully, the author will be more helpful.

ADD REPLYlink written 6.7 years ago by SES8.4k
1
gravatar for Martin Stervander
2.7 years ago by
University of Oregon, US
Martin Stervander10 wrote:

To do this, one can convert the coordinate information in the repeatmasker outfile into bed format (e.g. using awk) and then use the bed file and the assembly fasta file as input to BEDtools maskfasta feature, applying the -soft flag. E.g.:

bedtools maskfasta -soft -fi assembly.fasta -bed mask_coordinates.bed -fo assembly_softmasked.fasta

Cheers, Martin

ADD COMMENTlink written 2.7 years ago by Martin Stervander10
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1033 users visited in the last hour