How can I convert a MAF of HAL (from cactus-hal2maf) to Multiple FASTA?
1
0
Entering edit mode
11 weeks ago
Mimi33 ▴ 30

(I edited some typo.)

Hi, I am planning to using GERP++ with a whole genome alignment. I want to get a multiple alignment FASTA file from my results from Cactus, a whole-genome aligner (https://github.com/ComparativeGenomicsToolkit/cactus/tree/master). I seems to have a correct HAL file with cactus from multiple genome FASTA files, then I convert it to MAF file with cactus-hal2maf.

singularity exec cactus.sif cactus-hal2maf ./jobStore ${hal} ${maf} \ --refGenome REF --noAncestors --chunkSize 1000000 --batchCores 4 \ --filterGapCausingDupes --refSequence ${chr}

NOTE: Although cactus has hal2fasta, but it only extracts original sequences without alignment information.

Next, I attempted to split the MAF file by chromosome using msa_view, but I encountered an error. Other programs I've tried have not been successful either.

The issue seems to be a mismatch between the reference sequences in my MAF file and the original genome:

msa_view cactus_alignment/Chromosome1.maf -f -G 1 --refseq Chromosome1.fasta > Chromosome1.maf.fasta

I received the following error:

Splitting 1 files by target sequence -- ignoring first argument dummy.bed
cactus_alignment/Chromosome1.maf
ERROR: character 'T' at position 108901 of reference sequence does not match character 'a' given in MAF file.

How can I convert my alignment to a multiple FASTA?

Thanks in advance for your help!

MAF cactus • 520 views
ADD COMMENT
3
Entering edit mode
10 weeks ago
Mimi33 ▴ 30

I solved the problem by myself.

  1. I set --dupemode single in cactus-hal2maf. all and consensus did not worked (I wonder it would caused by how the program treats self-alignment).
  2. msa_view may prefer to read a chromosome FASTA from hal2fasta to an original one. (I don't know why.) Additionally, gerpcol can only read the oneline FASTA.
  3. gerpcol cannot read extra characters which are not alphabets and "-" (ex. msa_view cannot read "-"). So I replaced * into - (I found gerpcol can read the MAF file if without -a option.)

Set dupeMode option as 'single'

The default option is as 'all'

cactus-hal2maf  ./js_${chr} ${hal} ${maf} \
 --refGenome REF --noAncestors --chunkSize 1000000 --batchCores 4 \
 --filterGapCausingDupes --refSequence ${chr} --dupeMode single \
 --rootGenome REF

Extract FASTA from the same hal

hal2fasta  --outFaPath ${chr_fasta} --sequence ${chr} ${hal} REF

Make FASTA per oneline

msa_view ${maf} -f -G 1 --refseq ${chr_fasta} --unmask | seqkit seq -w 0 > ${maf}.fasta

Standardize the gap character to "-"

gerpcol can only read alphabets and "-"

sed -e 's/\*/-/g' ${maf}.fasta > ${maf}.gap-replaced.fasta
ADD COMMENT

Login before adding your answer.

Traffic: 2749 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6