Entering edit mode
6.1 years ago
DanielC
▴
170
Dear Friends,
I am looking to predict the sequences of tail proteins of phage lysin within the phage genome. I have the fastq files. Can you please give your suggestions on what could be the best way to perform this?
Thanks much!
If you have fastq files, you probably do not have an assembled phage genome. What exactly is the data you have? Did you isolate and sequenced isolated and purified phages?
Assuming you have sequencing files, you should assemble the genome, then blast the resulting contigs for annotation.
Thanks! Yes, I have sequence reads of bacteriophages in fastq files.
I am new to this, could you please let me know of the steps to perform this? like what tools to use?
Thanks for your time.
You can use SPAdes (this is a small genome) to assemble the data. Look at the examples on their page to see how to use this software. Once assembly is done you can use blast searches to do annotation.
Edit: You may also want to look at
tadpole.sh
from BBMap suite. It does well with small viral genome assemblies.You can use a general genome assembler, or use a pipeline tailored for virus genomes, e.g. metaViC or IVA. Blast is NCBI Blast, if you have few resulting contigs, you canuse the online version.
Thanks for your suggestions. I did the above and got 10contigs. I would really appreciate if you could give your comments on these:
a) The fastq files I am running the analysis on is produced from Iontorrent sequencer. There are three fastq files with same identifier ID in all three files, like this - "@4C7U2:1234:1345". I think these are paired-end reads? But, can pair-end reads be in three files? and, how can one run SOAPdenovo on these 3 files at once considering they contain paired-ends, because as far as I know SOAPdenovo takes only two files at a time?
b) I ran SOAPdenovo on two fastq files and then ran blastn on the contigs generated from SOAPdenovo and got good hits 100% and 99% identity score like this:
The question is how to know that "the contigs belong to the tail protein of the phage or in other words is the tail of the phage"?
I am not familiar with Ion Torrent and its naming conventions, so can't help. I believe generally it is single-end sequencing. Could be paired end, but that would imply two files. Maybe the third are indexes?
As you keep asking, I am under the impression someone worked on this project ahead of you, and then handed you some files without explaining. Please get acquainted with all steps of the project, including at least some basic level of knowledge about the technology used.
When your blast is 99% identical to "16S ribosomal RNA gene" over 99% over the query length, I am pretty sure it is not a phage tail protein. This is what I get when I blast a tail protein:
It may be the case the blast hits are to complete phage genomes, so the header will not say "tail protein", you will have to find it there.
You can also search for protein domains, there is a number of domains associated with phage tail proteins. That would involve predicting the proteins from the nucleotide sequences, then searching for the domains.
Thanks! I am thinking of doing blastx on the contigs got from assembly software and then look for proteins obtained from blastx in the PFAM phage tail family. Will this approach be reasonable?