Question: R: apex Multigene alignment problems "DNA sequences in list not of the same length"
gravatar for samche42
22 months ago by
samche420 wrote:


I'm trying to create a single phylogenetic tree from a concatenation of 10 000 individual gene alignments.

I used OMA to find 10 000 orthologous genes from 23 bacterial genomes. I then used muscle to align sequences in multifasta files of each orthologous gene set using the following:

muscle -in ${protein_file}.fa -out ${protein_file}.fasta

where protein_file is represents the multifasta files of each of the orthologous groups generated by OMA.

I then tried to input the alignments into R using the following:

library(apex) x <- read.multiFASTA(dir(pattern=".fasta"))

And was greeted with an error reading:

Error in as.matrix.DNAbin(X[[i]], ...) : DNA sequences in list not of the same length.

I tried replacing "-" in the alignments with an X. I also tried trimming the sequences with trimal:

trimal -in ${protein_file}.aln -out ${protein_file}.fasta -fasta -automated1 -resoverlap 0.75 -seqoverlap 80

(*I created the .aln files using muscle and used that as input for trimal)

I also tried variations of the trimming without the -resoverlap 0.75 or -seqoverlap 80 flags.

but I was still stuck with the same error.

I then considered just using muscle to create individual trees for each ortholog and process it with ASTRAL but I couldn't figure out how to concatenate all the trees into a single .tree file (required input for ASTRAL).

I'm tearing my hair out over what I suspect to be a trivial mistake somewhere but can't figure out where I'm tripping up. Any help would be hugely appreciated!!

Thank you in advance :)

ADD COMMENTlink written 22 months ago by samche420
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1922 users visited in the last hour