Question

How to predict the sequences of tail proteins of phage lysin within phage genome (fastq format)?

0

Entering edit mode

7.3 years ago

DanielC ▴ 210

Dear Friends,

I am looking to predict the sequences of tail proteins of phage lysin within the phage genome. I have the fastq files. Can you please give your suggestions on what could be the best way to perform this?

Thanks much!

phage fatsq predict • 1.9k views

ADD COMMENT • link 7.3 years ago by DanielC ▴ 210

1

Entering edit mode

If you have fastq files, you probably do not have an assembled phage genome. What exactly is the data you have? Did you isolate and sequenced isolated and purified phages?

Assuming you have sequencing files, you should assemble the genome, then blast the resulting contigs for annotation.

ADD REPLY • link 7.3 years ago by h.mon 35k

0

Entering edit mode

Thanks! Yes, I have sequence reads of bacteriophages in fastq files.

"you should assemble the genome, then blast the resulting contigs for annotation."

I am new to this, could you please let me know of the steps to perform this? like what tools to use?

Thanks for your time.

ADD REPLY • link 7.3 years ago by DanielC ▴ 210

1

Entering edit mode

You can use SPAdes (this is a small genome) to assemble the data. Look at the examples on their page to see how to use this software. Once assembly is done you can use blast searches to do annotation.

Edit: You may also want to look at tadpole.sh from BBMap suite. It does well with small viral genome assemblies.

ADD REPLY • link 7.3 years ago by GenoMax 152k

1

Entering edit mode

You can use a general genome assembler, or use a pipeline tailored for virus genomes, e.g. metaViC or IVA. Blast is NCBI Blast, if you have few resulting contigs, you canuse the online version.

ADD REPLY • link 7.3 years ago by h.mon 35k

0

Entering edit mode

Thanks for your suggestions. I did the above and got 10contigs. I would really appreciate if you could give your comments on these:

a) The fastq files I am running the analysis on is produced from Iontorrent sequencer. There are three fastq files with same identifier ID in all three files, like this - "@4C7U2:1234:1345". I think these are paired-end reads? But, can pair-end reads be in three files? and, how can one run SOAPdenovo on these 3 files at once considering they contain paired-ends, because as far as I know SOAPdenovo takes only two files at a time?

b) I ran SOAPdenovo on two fastq files and then ran blastn on the contigs generated from SOAPdenovo and got good hits 100% and 99% identity score like this:

Uncultured bacterium clone PAE-EN23_12 16S ribosomal RNA gene, partial sequence
    470     470     99%     4e-129  99%     KC238410.1

The question is how to know that "the contigs belong to the tail protein of the phage or in other words is the tail of the phage"?

ADD REPLY • link 7.3 years ago by DanielC ▴ 210

1

Entering edit mode

I am not familiar with Ion Torrent and its naming conventions, so can't help. I believe generally it is single-end sequencing. Could be paired end, but that would imply two files. Maybe the third are indexes?

As you keep asking, I am under the impression someone worked on this project ahead of you, and then handed you some files without explaining. Please get acquainted with all steps of the project, including at least some basic level of knowledge about the technology used.

The question is how to know that "the contigs belong to the tail protein of the phage or in other words is the tail of the phage"?

When your blast is 99% identical to "16S ribosomal RNA gene" over 99% over the query length, I am pretty sure it is not a phage tail protein. This is what I get when I blast a tail protein:

 Uncultured Myoviridae g91 mRNA for putative phage tail sheath protein, partial cds, clone: HirosawanoikePond090915-057
    2065    2065    100%    0.0 100%    AB690520.1

It may be the case the blast hits are to complete phage genomes, so the header will not say "tail protein", you will have to find it there.

You can also search for protein domains, there is a number of domains associated with phage tail proteins. That would involve predicting the proteins from the nucleotide sequences, then searching for the domains.

ADD REPLY • link 7.3 years ago by h.mon 35k

0

Entering edit mode

Thanks! I am thinking of doing blastx on the contigs got from assembly software and then look for proteins obtained from blastx in the PFAM phage tail family. Will this approach be reasonable?

ADD REPLY • link 7.3 years ago by DanielC ▴ 210