Hi everyone, I am currently a bachelor's graduate (B.S. in Biochem Molec Bio, minors in CompSci and Chem) and I am working on a research project that might look good as a portfolio piece for a masters/PhD application. I am superrr new to bioinformatics stuff. My professor that I reached out to has several bacteriophage genomes that have been assembled that she is going to give me access to. they all showed highly lytic activity towards several clinical isolates of their respective host species (about 15 phages for one host, 4 phages for another, a few other phages for a third host). She mainly wants me to look for markers of lysogeny in these phages. I'm going to run them through the Sphae pipeline, which will give me annotated gbk files, and checks for integrases, virulence factors, AMR genes, defense genes, and CRISPR spacers. (https://academic.oup.com/bioinformaticsadvances/article/5/1/vbaf004/7959522) It also gives me a lot of information from Phynteny about possible functional annotations of hypothetical proteins in tsv files, PHOLD data in tsv files, pharokka annotation, what I believe to be and ProstT5 embeddings in a .pt file.
This is where I'm stuck at. My professor says "What you need to do to search for integrase, excisionase and repressor/antirepressor proteins and, if present, they likely are lysogenic ones and we need to drop them from our research."
Sphae automatically looks for integrases and excisionases, however, for bacteriophages the repressor/ antirepressor systems related to lysogeny and prophage activation are highly variable. I am not sure how to go about this. One idea I had was to create a "mini database" of known repressors and antirepressors related to lysogeny for phages of my specific host bacteria, and then command line blasting them against the proteins outputted by Sphae. I would make this database by reading a ton of research papers and collecting specific sequences related to phages of my host species. But I am not sure if this is common practice or reasonable considering the steps already taken by Sphae and if it would be redundant, or if I am doing this the hard way when there is already a tool available.
I also was wondering if there would be any merit to investigating the ProstT5 embeddings. I was going to also try and see if the embeddings showed any relationships between hypothetical proteins that phynteny marked as "lytic" based on synteny. But again I don't know if this would be worth doing, I'm completely out of my depth here.
Thank you if you got to reading this far I really appreciate anyone putting their time towards helping me. I would be so thankful for any ideas of research questions I should pursue or opinions on how I should proceed with the data I currently have.
Finding genes by similarity in a few small genomes is an easy task in bioinformatics, but you must make sure that you are asking the right research question first. For example, I cannot make sense of this:
And I think posing a proper research question is the most important thing and warrants a proper discussion with your supervisor. E.g.: Why would you drop genomes that contain those genes? Don't all phages need those lysis proteins? I'm not an expert in phage biology, but if somebody came to me with this question, I'd a priori assume they are asking the wrong question unless they convinced me of the opposite (I'd assume that anyway, out of experience). Possibly, in the process of the resulting discussion, we normally both figure out a proper initial hypothesis and what is feasible.
Strictly lytic phages will not have integrate or excisionase, as those are only required for the dormant lysogenic cycle. Finding their presence means the phage is capable of lysogenic activity, so it would not be great for therapeutic use and might even bolster the immunity of its host bacteria against other phages. Sphae already checks for these sequences against the PHROG db. And the repressor/ antirepressor systems are present in phages that switch between lytic and lysogenic activity, meaning it is also a phage that can enter the lysogenic cycle and muddy therapeutic use. These systems are highly variable among phages however, and Sphae doesn’t check explicitly for these markers. You said that “finding genes by similarity in bioinformatics is an easy task”, I guess I’m more worried about forming a small database of relevant genes based on recent literature regarding my specific bacterial host. I’m not sure if this is redundant considering what sphae has already checked against (the link in my above post) I’m just wondering if that is something that people normally do when investigating these genomes. Also, I am very new to this so it might be easy but I’m really not sure what steps to start with, would you be able to point me in the right direction?
Ok, thank you for the explanation. As I said, I am no expert in phage biology.
Meticulous literature study is generally good practice for any science project (it shouldn't bother you if it's common practice, shortcuts like ChatGPT are available these days, and too many will take those). You can also use protein family databases like InterPro to search for candidate sequences. For a few phage sequences, you could also annotate all of their CDS directly with InterProScan.
I'd always start not with building a sequence database, but a literature database, and check specifically for reviews on this topic. Possibly, others have performed similar searches before, in this case simply replicate their methods.
If nothing else is found, create a sequence database of known protein sequences of interest in FASTA format, create a Blast database of it, and run BlastX with the phage genomes directly.
Even if the pipeline you employ does something similar, you will want full control over the template sequences searched and the E-value and filtering cutoff. I would not bother with whether this function partially overlaps with your pipeline.
Please let us know if you need help with any specific step.