R: apex Multigene alignment problems "DNA sequences in list not of the same length"
0
0
Entering edit mode
5.6 years ago
samche42 • 0

Hi!

I'm trying to create a single phylogenetic tree from a concatenation of 10 000 individual gene alignments.

I used OMA to find 10 000 orthologous genes from 23 bacterial genomes. I then used muscle to align sequences in multifasta files of each orthologous gene set using the following:

muscle -in ${protein_file}.fa -out ${protein_file}.fasta

where protein_file is represents the multifasta files of each of the orthologous groups generated by OMA.

I then tried to input the alignments into R using the following:

library(apex) x <- read.multiFASTA(dir(pattern=".fasta"))

And was greeted with an error reading:

Error in as.matrix.DNAbin(X[[i]], ...) : DNA sequences in list not of the same length.

I tried replacing "-" in the alignments with an X. I also tried trimming the sequences with trimal:

trimal -in ${protein_file}.aln -out ${protein_file}.fasta -fasta -automated1 -resoverlap 0.75 -seqoverlap 80

(*I created the .aln files using muscle and used that as input for trimal)

I also tried variations of the trimming without the -resoverlap 0.75 or -seqoverlap 80 flags.

but I was still stuck with the same error.

I then considered just using muscle to create individual trees for each ortholog and process it with ASTRAL but I couldn't figure out how to concatenate all the trees into a single .tree file (required input for ASTRAL).

I'm tearing my hair out over what I suspect to be a trivial mistake somewhere but can't figure out where I'm tripping up. Any help would be hugely appreciated!!

Thank you in advance :)

R apex multigene alignment species tree astral • 2.2k views
ADD COMMENT

Login before adding your answer.

Traffic: 2166 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6