I am trying to infer a phylogeny from a multiple sequence alignment using FastTree program, however the program is giving me an error when I run it over the multiple sequence alignment and I can not figure out what the error is saying (not really that informative).
My original alignment has a total of 3,307 species (sequences), and 95,678 columns. I was able to infer a phylogeny successfully using FastTree over this alignment in about 16 hours. However upon checking my alignment I noticed it is highly fragmented and of very poor quality because I do not do any post processing after I concatenate my individual alignments obtained from MUSCLE.
I tried to post-process the alignment file using Gblocks however I think the default parameters are too conservative and it is resulting in an empty alignment file. I made my own python script to improve the quality of this alignment, and this is where the problem starts I think:
basically what I do in my script is I take in the MSA and check 3 things:
1) I remove all the columns that have gaps in more than half of the sequences. 2) I remove all the columns that dont have a consensus amino acid in at least 25% of the sequences 3) finally I remove all the rows (i.e. the species) that have gaps in more than half of their total sequence length
After these filtering criteria the script outputs a new multiple sequence alignment with the suffix '_clean_columns_clean_rows.fasta'. The new multiple sequence alignment now has 3,294 rows (i.e. species) and 30,634 columns.
Now I issue a FastTree command to infer a phylogeny over this new alignment with the following paramters:
FastTree -lg -pseudo -spr 4 -mlacc 2 -slownni supermat_clean_columns_clean_rows.fasta > supermat_clean_columns_clean_rows_FastTree.tree
I use the -pseudo -spr 4 -mlacc 2 -slownni parameters on purpose to perform a rigorous search and to get a tree as accurate as possible using the heuristic search. The program after running sometime it starts prtinting values for eigen vecters such as this:
eigeninv = c( 1.0797265, -0.1849245, -2.2736955, -1.3477029, ....
Followed by this error which I cant understand what it means:
# Transform into matrices and compute un-rotated vectors for profiles A and B codeFreq = matrix(codeFreq,nrow=20); eigeninv = matrix(eigeninv,nrow=20); unrotA = stat * (eigeninv %*% fA) unrotB = stat * (eigeninv %*% fB) # End of R block FastTree: /usr2/people/mprice/Genomics/html/fasttree/FastTree.c:5383: PairLogLk: Assertion `lkAB > 0' failed. Aborted
To reproduce this result I am sharing links for both the original alignment and the post-processed alignment using my python script:
original alignment file "supermat.fasta" ( ~300 MB): https://iu.box.com/s/xx0mbvgyh50e1isw7nh8lylrejg4pe3x
post-processed alignment file "supermat_clean_columns_clean_rows.fasta" (96 MB) : https://iu.box.com/s/b8i97a39bp8pnna0hhltoa0xm7xqwqz3
I am also pasting the python script that I wrote to perform these post processing here Any help is greatly appreciated (many thanks for your time!!):
from Bio import SeqIO import numpy as np import copy from collections import Counter msa_in_f = 'allBinsWithArchaea_FastTree_2/supermat.fasta' prefix_f, suffix_f = msa_in_f.rsplit('.'), msa_in_f.rsplit('.') msa_seqs = list(SeqIO.parse(msa_in_f, 'fasta')) headers = [stritem.id) for item in msa_seqs] seqs = [str(item.seq) for item in msa_seqs] np_headers = np.array(headers) concat_seqs = ''.join(seqs) matrix = np.array(list(concat_seqs), dtype=str).reshape(len(headers), len(seqs)) clean_mat = copy.deepcopy(matrix) n_rows, n_cols = matrix.shape #remove columns having more than 50% gaps remove_cols = list() for i in range(n_cols): col = str(''.join(matrix[:, i])) if col.count('-') > round(n_rows/2): remove_cols.append(i) clean_mat = np.delete(matrix, remove_cols, 1) #remove columns where the consensus amino acid is less than 25% remove_cols = list() for i in range(clean_mat.shape): col = str(''.join(clean_mat[:, i])) char_counter = Counter(col) del char_counter['-'] if max(char_counter.values())/n_rows < 0.25: remove_cols.append(i) clean_mat = np.delete(clean_mat, remove_cols, 1) with open(prefix_f + '_clean_columns.'+ suffix_f, 'w') as out_f: for i in range(clean_mat.shape): row = str(''.join(clean_mat[i, :])) out_f.write('>'+str(headers[i])+'\n') out_f.write(row+'\n') #remove rows where the the alignment is more than 50% gaps new_n_rows, new_n_cols = clean_mat.shape remove_rows = list() for i in range(clean_mat.shape): row = str(''.join(clean_mat[i, :])) if row.count('-') > new_n_cols/2: remove_rows.append(i) clean_mat_reduced_rows = np.delete(clean_mat, remove_rows, 0) np_headers_reduced_rows = np.delete(np_headers, remove_rows) with open(prefix_f + '_clean_columns_clean_rows.'+ suffix_f, 'w') as out_f: for i in range(clean_mat_reduced_rows.shape): row = str(''.join(clean_mat_reduced_rows[i, :])) out_f.write('>'+str(np_headers_reduced_rows[i])+'\n') out_f.write(row+'\n') print('original input matrix shape,', matrix.shape) print('clean columns matrix shape', clean_mat.shape) print('clean columns, clean rows matrix shape', clean_mat_reduced_rows.shape) with open(prefix_f + 'reduction_stats.txt', 'w') as out_f: out_f.write('original input matrix shape,' + str( matrix.shape)+'\n') out_f.write('clean columns matrix shape' + str(clean_mat.shape)+'\n') out_f.write('clean columns, clean rows matrix shape' + str( clean_mat_reduced_rows.shape)+'\n')