R: apex Multigene alignment problems "DNA sequences in list not of the same length"
Entering edit mode
5.6 years ago
samche42 • 0


I'm trying to create a single phylogenetic tree from a concatenation of 10 000 individual gene alignments.

I used OMA to find 10 000 orthologous genes from 23 bacterial genomes. I then used muscle to align sequences in multifasta files of each orthologous gene set using the following:

muscle -in ${protein_file}.fa -out ${protein_file}.fasta

where protein_file is represents the multifasta files of each of the orthologous groups generated by OMA.

I then tried to input the alignments into R using the following:

library(apex) x <- read.multiFASTA(dir(pattern=".fasta"))

And was greeted with an error reading:

Error in as.matrix.DNAbin(X[[i]], ...) : DNA sequences in list not of the same length.

I tried replacing "-" in the alignments with an X. I also tried trimming the sequences with trimal:

trimal -in ${protein_file}.aln -out ${protein_file}.fasta -fasta -automated1 -resoverlap 0.75 -seqoverlap 80

(*I created the .aln files using muscle and used that as input for trimal)

I also tried variations of the trimming without the -resoverlap 0.75 or -seqoverlap 80 flags.

but I was still stuck with the same error.

I then considered just using muscle to create individual trees for each ortholog and process it with ASTRAL but I couldn't figure out how to concatenate all the trees into a single .tree file (required input for ASTRAL).

I'm tearing my hair out over what I suspect to be a trivial mistake somewhere but can't figure out where I'm tripping up. Any help would be hugely appreciated!!

Thank you in advance :)

R apex multigene alignment species tree astral • 2.2k views

Login before adding your answer.

Traffic: 2166 users visited in the last hour
Help About
Access RSS

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6