I am currently working with a bacterial strain (let’s call it Strain A) for which I have a complete genome assembly (StrainA_genome.fasta). From this strain, I obtained two sets of phage-related sequences using different experimental approaches:
1. An assembled phage genome from lab experiments: StrainA_phage_gapfilled.fasta
2. A pair of short-read sequencing files from another phage preparation: StrainA_phage_R1.fasta and StrainA_phage_R2.fasta
Separately, I predicted prophage regions on StrainA_genome.fasta using Cenote-Taker 3 and PhageBoost, and collected all the predicted prophage sequences into a file: StrainA_predicted_phages.fasta.
My goal is to verify whether the experimentally derived phage sequences correspond to any of the predicted prophage regions, i.e., to confirm that the experimentally recovered sequences truly originate from prophages within Strain A.
What I tried
I attempted various alignments, including:
I used bwa mem to align both the raw phage reads and the gap-filled phage assembly against the predicted prophage regions, using commands like:
# Align short reads to predicted phages
bwa mem -t 4 StrainA _predicted_phages.fasta StrainA_phage_R1.fasta StrainA_phage_R2.fasta > phage_reads_vs_predicted.sam
# Align assembled phage genome to predicted phages
bwa mem -t 4 StrainA_predicted_phages.fasta StrainA_phage_gapfilled.fasta > phage_assembly_vs_predicted.sam
none of the reads or assemblies mapped to any of the predicted prophage sequences.
I then aligned the same experimental phage sequences (both reads and the assembled genome) directly to the full bacterial genome (StrainA_genome.fasta), expecting at least partial matches in the prophage regions. However, there were still no alignments.
# Also tried aligning both to the host genome:
bwa mem -t 4 StrainA_genome.fasta StrainA_phage_gapfilled.fasta > phage_vs_host.sam
This was very unexpected — if the prophages are indeed part of the genome, I would expect at least some reads to map to those regions.
I also tried the same comparisons using Minimap2/bowtie2/blast, even tuned parameters to allow more mismatches and gaps. Unfortunately, there were no alignments..
To verify the identity of the experimentally obtained phage sequences, I submitted StrainA_phage_gapfilled.fasta to NCBI ORF Finder, extracted several of the longer predicted protein sequences, and then ran BLASTp against the virus (taxid:10239) database. These searches returned high-confidence hits to known phage proteins, supporting the idea that the experimental sequences are indeed phage-derived.
Additionally, I extracted CRISPR spacer sequences from Strain A using CRISPRCasTyper, and tried to align them (using BLAST) to the gap-filled experimental phage genome. Only one spacer hit was found, with a relatively low score (e-value ~1e-2).
I am now wondering:
1. Could this complete lack of alignment be caused by my alignment strategy?
Are there better tools or methods specifically suited to phage–prophage or phage–host genome comparisons?
2. Could experimental artifacts explain this?
I do not perform the experimental work myself. Could the phage DNA isolation,amplification, or assembly methods introduce chimerism or unrelated sequences that mislead mapping?
3. Or is it genuinely common to see such divergence between experimentally isolated phages and prophage predictions from the same strain genome?
Any thoughts or suggestions on improving the alignment strategy or interpreting these results would be greatly appreciated.
Is this a publicly available reference genome (preferably RefSeq version) for your organism (strain)? Have you compared it yo
StrainA_phage_gapfilled.fasta
. There is a chance that the assembly you put together (was any long read data involved?) has errors in it that are causing some of the issues above. What kind of QC did you do for this assembly?Or are
StrainA_phage_gapfilled.fasta
andStrain_genome.fasta
both derived by you. In that case you should compare them with RefSeq versions of genomes from NCBI. Have you done this?As a sanity chec,k you can take the original sequence (non-assembled) and use
bbduk.sh
from BBMap suite in filter-mode to see if there are a sub-set of reads match those of the virus you expect to be there.Thank you!
The
StrainA_genome.fasta
file is a complete genome assembly generated by our team and has been submitted to NCBI. For all alignments, I used the corresponding RefSeq version, so I think the genome reference itself is not the issue.The
StrainA_phage_gapfilled.fasta
sequence was derived from an experimentally extracted phage sample, which was then processed by a third-party company using their own gap-filling algorithm.I tried to use BBduk to mapped the raw sequencing reads (before assembly) used to generate StrainA_phage_gapfilled.fasta back to our host genome (StrainA_genome.fasta), the mapping rate was 99%, showing that the raw reads are largely derived from the host genome.However, when I re-mapped those same raw reads to the assembled phage genome (StrainA_phage_gapfilled.fasta), the mapping rate was only 0.1%. I’m not entirely sure why this is the case.
One possibility I’m considering is to extract that 0.1% subset of reads that successfully map to the StrainA_phage_gapfilled.fasta, and then see where they map on the host genome. This might allow us to pinpoint their origin and compare those regions with the prophage regions predicted by tools like Cenote-Taker3 and PhageBoost. The goal is to confirm whether the experimentally derived sequences actually overlap with predicted prophage loci.
In addition, the files
StrainA_phage_R1.fasta
andStrainA_phage_R2.fasta
contain only 26 reads in total. These appear to be a small, filtered set of candidate phage reads selected by the company. However, after quality control and primer removal, these reads fail to map to the host genome entirely. Using BLAST, only very short matches (<20 bp) could be found, which are clearly low-confidence.Sounds like this is just the phage genome, correct? Having the
StrainA
in the name is confusing since it seems to indicate that there are host sequences in there.Why is that surprising. You know that 99% of original raw reads are host genome so only the remaining 1% reads are going to be potentially phage.
You know for sure that these are going to be phage and nothing else? If so these reads should align 100% to
StrainA_phage_gapfilled.fasta
(phage genome obtained from the separate experiment). Do they cover the entire phage genome?Have you checked the NCBI viral genome database using
StrainA_phage_gapfilled.fasta
to see if it "hits" the expected phage in the databases.Is
StrainA_genome.fasta
co-linear with publishedRefSeq
version? Does it represent the entire genome?