I need to do a whole genome alignment on a largish metazoan (~1.8 GBp), and the sensible approach seems to be to mask the repetitive alements before doing that so that the number of all versus all seed matches is reduced.
The Lastz whole genome alignment tool has a softmasking option where masked regions aren't used for finding seeds (but can be used when extending seeds to build alignments). That sounds perfect to me.
However, I can't figure out how to specify the softmasked regions. Does someone know how to do this?
The Lastz documentation at http://www.bx.psu.edu/miller_lab/dist/README.lastz-1.02.00/README.lastz-1.02.00a.html#fmt_mask says
Sequence Masking File This file is used with the xmask and nmask actions in a sequence specifier. It consists of one interval per line, without sequence names. Lines beginning with a # are considered to be comments and are ignored, as are blank lines. Only the first two whitespace-delimited words in any line are interpreted as the interval; the rest of the line is ignored. Each interval describes a region to be masked, and consists of <start> <end> Locations are one-based and inclusive on both ends (i.e., they use the origin-one, closed position numbering system). Note that the masking intervals are counted along the forward strand, even if we are only aligning to the reverse complement of the query specifier (i.e. for ‑‑strand=minus). Here is an example. If the target sequence is hg18.chr1, this would mask the 5' UTRs from several genes. Note that the third column is neither required nor interpreted by LASTZ, and acts as a comment. 884484 884542 NM_015658 885830 885936 NM_198317 891740 891774 NM_032129 925217 925333 NM_021170 938742 938816 NM_005101 945366 945415 NM_198576 1016787 1016808 NM_001114103 1017234 1017346 NM_001114103 1041303 1041486 NM_001114103
My question is - if the Target and Query files (for alignment) have multiple sequences, then how do you specify the masked regions if there is only a <start><end> specification? (you'd also need a
Do you know if there is a way around this or a different tool that will do the same job with masked sequences?