Hi stranger,
i have a data set that is consisting of sequenced results from a phage display. I have done a few things with it, I think the most important points for the problem I am facing are: translating the sequences with NNK codon usage, creating a list from those sequences and convert this list into a list of ProteinSequences from biotite.sequences. With this list i tried to do a multiple sequence alignment with Mafft. I used biotites biotite.application.mafft.MafftApp for that. I wanted to use Mafft because the data set that i am working with consist of many identical sequences and I read here that Mafft would be suitable for such a task. The sequences are short (7 amino acids long) and in the data set are about 1.1 million sequences in total. The untranslated data set you can find here. For the translated sequences i can only give you a .xlsx file. Using biotite and Mafft i get this Error: TreeError: The tree's indices are out of range. I am kind of clueless on what to do with this information. With the msa i want to create sequence logos and do a bit of clustering. Seeing which sequences might be related and that good stuff. Also i am kind of beginner with coding and bioinformatics in general, so please don't be mad if a made obvious mistakes, but it would be cool if you could point out what I can do better :)
Here is the relevant python code:
This adds the needed biotite sequence type to every sequence in sequence list.
import matplotlib.pyplot as plt
import biotite.sequence as seq
import biotite.sequence.graphics as graphics
from biotite.application.mafft import MafftApp
def list_to_biotite_seq(translated_list):
"""change list of translated sequences into a list of seq.ProteinSequences from biotite"""
aa_seqs = []
for aa_seq in translated_list:
biotite_seq = seq.ProteinSequence(aa_seq)
aa_seqs.append(biotite_seq)
return aa_seqs
And this is the part which throws me an error. To be more precise the app.join() fails.
try:
from isolate_translate import translate_nnk_lib
except ImportError as i_error:
print("\n", i_error)
else:
save_path = result_directory
save_file_logo1 = input("\nGive sequence logo a name and safe as .png, .jpg or .pdf, "
"for example: 'example_logo.pdf'.")
complete_path_logo = os.path.join(save_path, save_file_logo1)
# create a list of biotite.ProteinSequences from list of translated inserts
# this works
nnk_seq = list_to_biotite_seq(translate_nnk_lib)
# invoke MSA algorithm
app = MafftApp(nnk_seq, bin_path="/usr/bin/mafft")
app.start()
app.join()
alignment = app.get_alignment()
# create sequence logo from alignment
fig_logo = plt.figure(figsize=(8.0, 1.5))
ax = fig_logo.add_subplot(111)
profile = seq.SequenceProfile.from_alignment(alignment)
graphics.plot_sequence_logo(ax, profile)
fig_logo.set_title("Sequence logo for library with NNK usage", fontsize=18)
plt.xlabel("Amino acid position in library", fontsize=18)
plt.ylabel("Bits", fontsize=18)
fig_logo.tight_layout()
plt.savefig(complete_path_logo, transparent=True)
This is the traceback i get.
Traceback (most recent call last):
File "/home/user/anaconda3/envs/statistical_analysis_test/lib/python3.9/site-packages/biotite/application/localapp.py", line 239, in join
self.evaluate()
File "/home/user/anaconda3/envs/statistical_analysis_test/lib/python3.9/site-packages/biotite/application/mafft/app.py", line 91, in evaluate
self._tree = Tree.from_newick(newick)
File "src/biotite/sequence/phylo/tree.pyx", line 324, in biotite.sequence.phylo.tree.Tree.from_newick
File "src/biotite/sequence/phylo/tree.pyx", line 94, in biotite.sequence.phylo.tree.Tree.__init__
biotite.sequence.phylo.tree.TreeError: The tree's indices are out of range
I did a bit of digging and found that in the biotite.sequence.phylo.tree the Class Tree raises that error. But I still don't understand what that exactly means. I already tested Clustal-Omega, which failed. Am I doing the right thing or should i take a different approach?