This is a somewhat of a follow-up from a previous post (https://www.biostars.org/p/397137/). I was recently given 4 paired-end .fastq files (where each read has about 150 bases), each from a different strain of bacteria. The researchers seem to expect these 4 samples to likely be from the same genre (Helicobacter), although some of the species within this genre are known for having huge genetic diversity.
Tasks they recently requested of me include the following:
1) Determine the number of genes in each sequence.
2) Determine the number of protein coding genes in each sequence.
3) Determine if a certain gene (vacA) is present in each sequence. If present, determine the number and location of this gene.
4) Perform chromosome alignment visualization using MUAVE to determine differences in location and orientation of shared DNA sequences (Similar to here: http://darlinglab.org/mauve/user-guide/viewer.html)
What I have tried so far:
1) I used SPAdes to create contigs for each sample. Then, I blasted several of the contigs. I noticed that most hits came back from the Helicobacter genus. However, there were strong hits for many different species of this genus (which I have read are known to be genetically diverse). I chose one of the species that seemed to have the most strong hits and downloaded it as a reference genome (let's call it h.fasta). Then, I ran QUAST on all four samples, both using the reference genome and without using the reference genome. I found that only Sample 1 showed results in the Contig alignment viewer. Samples 2-4 were blank.
2) Unsure about how to interpret the results from (1), I then decided to try another approach. I aligned each sample to the reference genome h.fasta using BWA. Then, I uploaded the .bam files to Qualimap. I found that Sample 1 had 85% mapping rate, but Samples 2-4 had only 13% mapping rate. A helpful poster in my previous post suggested that >95% mapping rate was indicative of a "good reference". So, this step made me think my reference may not be suitable.
3) Unsure about how to interpret results from (1) and (2), I then decided to try another approach. I decided to try to use an aligner like mummer on the contigs (previously output from SPAdes). As was suggested to me by a helpful poster in my previous post, I would first need to "reorder" the contigs using MUAVE. As described in my previous post, I received an error. Regardless, this process seemed to require a reference genome and given my results from (1) and (2), I do not think my reference genome selection was too reliable in the first place.
My current questions:
1) The scientists who gave me these files also do not seem to have advice on suitable reference genome(s) for these samples. Is there a more effective way for me to determine suitable reference genomes for these samples? Based on QUAST results, the first sample seems different than the rest (larger number of base pairs and smaller GC%). So, I may need to find different reference genomes for different samples (if reference genomes are necessary in the first place).
2) Given the tasks that the scientists have requested from me, is it necessary to find reference genomes in the first place? For instance, can I reliably achieve tasks 1-3 (determine the number of genes, the number of protein coding genes, and the presence of particular genes - like vacA - in these samples) by simply using the original .fasta read files or the contig .fasta files that are output from SPAdes? Or, do I really need to either fully assemble or align the samples in order to complete these three tasks?
3) Given some of the problems that have arisen in what I have tried so far, what next steps may be pursued to work toward completing the assigned tasks?
I realize this post is long. I am grateful for advice, even if only on one component of it. I am really learning a lot of software and ways to think about assembly/alignment of bacterial genomes and thank any shared ideas.