FastTree error while constructing tree
1
1
Entering edit mode
2.2 years ago
Moses ▴ 130

Hey All,

I am trying to infer a phylogeny from a multiple sequence alignment using FastTree program, however the program is giving me an error when I run it over the multiple sequence alignment and I can not figure out what the error is saying (not really that informative).

My original alignment has a total of 3,307 species (sequences), and 95,678 columns. I was able to infer a phylogeny successfully using FastTree over this alignment in about 16 hours. However upon checking my alignment I noticed it is highly fragmented and of very poor quality because I do not do any post processing after I concatenate my individual alignments obtained from MUSCLE.

I tried to post-process the alignment file using Gblocks however I think the default parameters are too conservative and it is resulting in an empty alignment file. I made my own python script to improve the quality of this alignment, and this is where the problem starts I think:

basically what I do in my script is I take in the MSA and check 3 things:

1) I remove all the columns that have gaps in more than half of the sequences.
2) I remove all the columns that dont have a consensus amino acid in at least 25% of the sequences
3) finally I remove all the rows (i.e. the species) that have gaps in more than half of their total sequence length


After these filtering criteria the script outputs a new multiple sequence alignment with the suffix '_clean_columns_clean_rows.fasta'. The new multiple sequence alignment now has 3,294 rows (i.e. species) and 30,634 columns.

Now I issue a FastTree command to infer a phylogeny over this new alignment with the following paramters:

FastTree -lg -pseudo -spr 4 -mlacc 2 -slownni supermat_clean_columns_clean_rows.fasta > supermat_clean_columns_clean_rows_FastTree.tree


I use the -pseudo -spr 4 -mlacc 2 -slownni parameters on purpose to perform a rigorous search and to get a tree as accurate as possible using the heuristic search. The program after running sometime it starts prtinting values for eigen vecters such as this:

eigeninv = c( 1.0797265, -0.1849245, -2.2736955, -1.3477029, ....


Followed by this error which I cant understand what it means:

# Transform into matrices and compute un-rotated vectors for profiles A and B
codeFreq = matrix(codeFreq,nrow=20);
eigeninv = matrix(eigeninv,nrow=20);
unrotA = stat * (eigeninv %*% fA)
unrotB = stat * (eigeninv %*% fB)
# End of R block
FastTree: /usr2/people/mprice/Genomics/html/fasttree/FastTree.c:5383: PairLogLk: Assertion lkAB > 0' failed.
Aborted


To reproduce this result I am sharing links for both the original alignment and the post-processed alignment using my python script:

original alignment file "supermat.fasta" ( ~300 MB): https://iu.box.com/s/xx0mbvgyh50e1isw7nh8lylrejg4pe3x

post-processed alignment file "supermat_clean_columns_clean_rows.fasta" (96 MB) : https://iu.box.com/s/b8i97a39bp8pnna0hhltoa0xm7xqwqz3

I am also pasting the python script that I wrote to perform these post processing here Any help is greatly appreciated (many thanks for your time!!):

from Bio import SeqIO
import numpy as np
import copy
from collections import Counter

msa_in_f = 'allBinsWithArchaea_FastTree_2/supermat.fasta'

prefix_f, suffix_f = msa_in_f.rsplit('.')[0], msa_in_f.rsplit('.')[1]

msa_seqs = list(SeqIO.parse(msa_in_f, 'fasta'))

headers = [stritem.id) for item in msa_seqs]
seqs = [str(item.seq) for item in msa_seqs]

concat_seqs = ''.join(seqs)

clean_mat = copy.deepcopy(matrix)
n_rows, n_cols = matrix.shape

#remove columns having more than 50% gaps
remove_cols = list()
for i in range(n_cols):
col = str(''.join(matrix[:, i]))
if col.count('-') > round(n_rows/2):
remove_cols.append(i)

clean_mat = np.delete(matrix, remove_cols, 1)

#remove columns where the consensus amino acid is less than 25%
remove_cols = list()
for i in range(clean_mat.shape[1]):
col = str(''.join(clean_mat[:, i]))
char_counter = Counter(col)
del char_counter['-']
if max(char_counter.values())/n_rows < 0.25:
remove_cols.append(i)

clean_mat = np.delete(clean_mat, remove_cols, 1)

with open(prefix_f + '_clean_columns.'+ suffix_f, 'w') as out_f:
for i in range(clean_mat.shape[0]):
row = str(''.join(clean_mat[i, :]))
out_f.write(row+'\n')

#remove rows where the the alignment is more than 50% gaps
new_n_rows, new_n_cols = clean_mat.shape
remove_rows = list()
for i in range(clean_mat.shape[0]):
row = str(''.join(clean_mat[i, :]))
if row.count('-') > new_n_cols/2:
remove_rows.append(i)
clean_mat_reduced_rows = np.delete(clean_mat, remove_rows, 0)

with open(prefix_f + '_clean_columns_clean_rows.'+ suffix_f, 'w') as out_f:
for i in range(clean_mat_reduced_rows.shape[0]):
row = str(''.join(clean_mat_reduced_rows[i, :]))
out_f.write(row+'\n')

print('original input matrix shape,', matrix.shape)
print('clean columns matrix shape', clean_mat.shape)
print('clean columns, clean rows matrix shape', clean_mat_reduced_rows.shape)

with open(prefix_f + 'reduction_stats.txt', 'w') as out_f:
out_f.write('original input matrix shape,' + str( matrix.shape)+'\n')
out_f.write('clean columns matrix shape' + str(clean_mat.shape)+'\n')
out_f.write('clean columns, clean rows matrix shape' + str( clean_mat_reduced_rows.shape)+'\n')

FastTree phylogeny Multiple Sequence Alignment • 1.3k views
0
Entering edit mode

It is very likely that the error is caused by your alignment manipulations. I suggest you try trimAl which is a tool for alignment trimming. It can do at least 2 out of 3 things you want.

0
Entering edit mode

thank you for your suggestion. I don't understand it however since after manipulation every row has the same number of columns and hence it is a standard MSA file, what is the actual reason for it breaking the program?

0
Entering edit mode

UPDATES! : I get the same error after using trimAL software to post-process the alignment

0
Entering edit mode

Separately from my previous answer, I don't understand the logic of doing 3000+ species unless you are trying to create some kind of authoritative resource. Removing redundant sequences at 80-90% identity will almost certainly cut it down to no more than 20% of your starting number. Does it really matter to you whether you have 20 Staphylococcus species on the same branch as opposed to 1-2 representative sequences? Not many people will look at the tree with 3000+ species without collapsing it because it is difficult to do so.

Along the same lines, it is unlikely that you will get better tree reconstruction from 90K+ poorly aligned columns than you would from 10-20 K well-aligned columns. I think you will make your life easier if you inspect individual alignments and remove poor ones before concatenation.

0
Entering edit mode

yes I want to keep species as much as possible I am interested in strain level differences as well that is the main reason. I am working on reducing the 90k column alignment, and using those 3 criteria it's reduced to 30k. But I will try and refine the alignment using that software you recommended.

I just doesnt really make much sense to me that FastTree was ran successfully on the original alignment of dimensions 3k x 90k and breaks on the smaller sized matrix.

0
Entering edit mode
6 weeks ago
jscience • 0

I don't fully understand this error, but I see that it occurs during the Maximum Likelihood (ML) step of FastTree's process, regardless of my alignment preprocessing. I was only able to avoid this error by turning off the ML step, by specifying the -noml option. Or, by limiting the ML NNI steps to 4 (just before the error occurs) with the option -mlnni`.

0
Entering edit mode

You may also be able to get past this error by using the WAG matrix instead of LG, and in that case you could leave the other parameters as you originally intended.