Lastz Genome Alignment Tool: How Does Softmasking Work?
3
1
Entering edit mode
10.2 years ago
Sujai Kumar ▴ 270

I need to do a whole genome alignment on a largish metazoan (~1.8 GBp), and the sensible approach seems to be to mask the repetitive alements before doing that so that the number of all versus all seed matches is reduced.

The Lastz whole genome alignment tool has a softmasking option where masked regions aren't used for finding seeds (but can be used when extending seeds to build alignments). That sounds perfect to me.

However, I can't figure out how to specify the softmasked regions. Does someone know how to do this?

The Lastz documentation at http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html#fmt_mask says

Sequence Masking File

This file is used with the xmask and nmask actions in a sequence specifier. It consists of one interval per line, without sequence names. Lines beginning with a # are considered to be comments and are ignored, as are blank lines. Only the first two whitespace-delimited words in any line are interpreted as the interval; the rest of the line is ignored.
Each interval describes a region to be masked, and consists of
    <start> <end>
Locations are one-based and inclusive on both ends (i.e., they use the origin-one, closed position numbering system). Note that the masking intervals are counted along the forward strand, even if we are only aligning to the reverse complement of the query specifier (i.e. for ‑‑strand=minus).
Here is an example. If the target sequence is hg18.chr1, this would mask the 5' UTRs from several genes. Note that the third column is neither required nor interpreted by LASTZ, and acts as a comment.
     884484  884542  NM_015658
     885830  885936  NM_198317
     891740  891774  NM_032129
     925217  925333  NM_021170
     938742  938816  NM_005101
     945366  945415  NM_198576
    1016787 1016808  NM_001114103
    1017234 1017346  NM_001114103
    1041303 1041486  NM_001114103

My question is - if the Target and Query files (for alignment) have multiple sequences, then how do you specify the masked regions if there is only a <start><end> specification? (you'd also need a <sequenceid> specifier).

Do you know if there is a way around this or a different tool that will do the same job with masked sequences?

alignment • 6.8k views
ADD COMMENT
2
Entering edit mode
10.2 years ago

I think this has to do with the way lastz works, it was designed to work with a single target.

When you give it multiple targets they get concatenated. So you may need to create masking for concatenated sequences.

I would double check with the author on this though. You could use the request help via email link and enter his email address: rsharris@bx.psu.edu

ADD COMMENT
2
Entering edit mode

Lastz author here, came across this thread as part of a search for something else. For the sake of completeness... here's the scoop on softmasking and lastz

If you do nothing else, any lowercase nucleotide in your sequence file will be considered as soft-masked.

Additionally, lastz can perform some operations on the sequence after it loads it from memory. One of these is to apply a file of intervals, changing all the indicated bases to lowercase.

As for the masking file not recognizing sequence names, this does come from the design history of originally working only with a single sequence. Recognizing three-column masking files was on my todo list for a while. But at this point, I'm not actively making changes other than bug fixes.

The simplest workaround, rather than concatenating the whole input sequence, would be to pipe the sequence through something that does the softmasking, and then pipe it into lastz (or save it in a temporary file).

Bob H

ADD REPLY
0
Entering edit mode

Thanks for the (potential) confirmation and suggestion for contacting the author. I'll do that

ADD REPLY
2
Entering edit mode
10.2 years ago
SES 8.6k

Soft masking refers to converting repeats to lower-case, while hard masking refers to replacing those bases with Xs (or Ns since some software won't recognize X). If you take a look at the RepeatMasker Annotation request form you can see there are options for specifying whether you want your genome lower-case masked, or hard masked with Xs or Ns. If your species is not listed there it may be necessary to do the masking yourself.

As for the input to LASTZ, it's not clear to me if the soft masked genome is required or that file of coordinates. Either way, it should not be difficult to create the coordinates file if that is required.

ADD COMMENT
0
Entering edit mode

Thanks for this. I already have masking info (from RepeatMasker). From what I can tell LASTZ wants the coordinate files if you want to do soft masking. I can do hard masking on my own by replacing with Ns but what I really want is to avoid masked regions during seed finding, and allow masked regions during seed extensions. But the coordinate file only has Start-End, no sequenceID, so I may have to concatenate the original sequence (and the masking coordinates) as Istvan Albert suggested.

ADD REPLY
0
Entering edit mode

I see, basically LASTZ is expecting one target to be represented in that mask file, not numerous sequences. Unless you could add sequences to the alignment iteratively somehow, I'm not sure what the best approach would be with LASTZ, sorry. Hopefully, the author will be more helpful.

ADD REPLY
1
Entering edit mode
6.3 years ago

To do this, one can convert the coordinate information in the repeatmasker outfile into bed format (e.g. using awk) and then use the bed file and the assembly fasta file as input to BEDtools maskfasta feature, applying the -soft flag. E.g.:

bedtools maskfasta -soft -fi assembly.fasta -bed mask_coordinates.bed -fo assembly_softmasked.fasta

Cheers, Martin

ADD COMMENT

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6