Question: Chromosome alignment visualization and gene elicitation for bacterial samples with unknown reference genomes
gravatar for LRStar
11 days ago by
United States
LRStar20 wrote:

This is a somewhat of a follow-up from a previous post ( I was recently given 4 paired-end .fastq files (where each read has about 150 bases), each from a different strain of bacteria. The researchers seem to expect these 4 samples to likely be from the same genre (Helicobacter), although some of the species within this genre are known for having huge genetic diversity.

Tasks they recently requested of me include the following:

1) Determine the number of genes in each sequence.

2) Determine the number of protein coding genes in each sequence.

3) Determine if a certain gene (vacA) is present in each sequence. If present, determine the number and location of this gene.

4) Perform chromosome alignment visualization using MUAVE to determine differences in location and orientation of shared DNA sequences (Similar to here:

What I have tried so far:

1) I used SPAdes to create contigs for each sample. Then, I blasted several of the contigs. I noticed that most hits came back from the Helicobacter genus. However, there were strong hits for many different species of this genus (which I have read are known to be genetically diverse). I chose one of the species that seemed to have the most strong hits and downloaded it as a reference genome (let's call it h.fasta). Then, I ran QUAST on all four samples, both using the reference genome and without using the reference genome. I found that only Sample 1 showed results in the Contig alignment viewer. Samples 2-4 were blank.

2) Unsure about how to interpret the results from (1), I then decided to try another approach. I aligned each sample to the reference genome h.fasta using BWA. Then, I uploaded the .bam files to Qualimap. I found that Sample 1 had 85% mapping rate, but Samples 2-4 had only 13% mapping rate. A helpful poster in my previous post suggested that >95% mapping rate was indicative of a "good reference". So, this step made me think my reference may not be suitable.

3) Unsure about how to interpret results from (1) and (2), I then decided to try another approach. I decided to try to use an aligner like mummer on the contigs (previously output from SPAdes). As was suggested to me by a helpful poster in my previous post, I would first need to "reorder" the contigs using MUAVE. As described in my previous post, I received an error. Regardless, this process seemed to require a reference genome and given my results from (1) and (2), I do not think my reference genome selection was too reliable in the first place.

My current questions:

1) The scientists who gave me these files also do not seem to have advice on suitable reference genome(s) for these samples. Is there a more effective way for me to determine suitable reference genomes for these samples? Based on QUAST results, the first sample seems different than the rest (larger number of base pairs and smaller GC%). So, I may need to find different reference genomes for different samples (if reference genomes are necessary in the first place).

2) Given the tasks that the scientists have requested from me, is it necessary to find reference genomes in the first place? For instance, can I reliably achieve tasks 1-3 (determine the number of genes, the number of protein coding genes, and the presence of particular genes - like vacA - in these samples) by simply using the original .fasta read files or the contig .fasta files that are output from SPAdes? Or, do I really need to either fully assemble or align the samples in order to complete these three tasks?

3) Given some of the problems that have arisen in what I have tried so far, what next steps may be pursued to work toward completing the assigned tasks?

I realize this post is long. I am grateful for advice, even if only on one component of it. I am really learning a lot of software and ways to think about assembly/alignment of bacterial genomes and thank any shared ideas.

ADD COMMENTlink modified 11 days ago • written 11 days ago by LRStar20
gravatar for Carambakaracho
11 days ago by
Carambakaracho1.6k wrote:
  1. To determine the best reference, you may want to download all Helicobacter genomes, see this post where I asked the same question. You can then use sourmash to compute k-mer signatures and determine relatedness. I recently used this approach and the key step is to compute the signatures for enough k-mer sizes (11 was most discriminative in my case)
  2. No, not necessarily. you can use a procaryotic annotation pipeline, such as Torsten Seeman's prokka or the NCBI's pgap pipeline

In case you still try to find out the reference and sourmash wasn't conclusive either, I'd drop the DNA based approach and try to go with proteins from the predictions

ADD COMMENTlink written 11 days ago by Carambakaracho1.6k

@Carambakaracho Thank you for your suggestions and I am researching them now. Can I clarify what you mean about going "with proteins from the predictions"? Would this approach not require reference genome? What software/procedure would I use to generate these predictions (i.e. predictions of proteins?)

ADD REPLYlink written 11 days ago by LRStar20

Well not exactly. It would need a reference database - I had a specific workflow in mind where you essentially define a core proteome of your genomes (proteins that exist in each genome) and compare it against some reference database. However, I half read their publication at best, but I can't find anymore...

ADD REPLYlink written 11 days ago by Carambakaracho1.6k

Here's the software and the paper I had in mind, it's called ezTree

ADD REPLYlink modified 10 days ago • written 10 days ago by Carambakaracho1.6k

@Carambakaracho. Thank you for your help. I found the sourmash tool to be a useful resource I did not previously know about. I indeed followed your helpful advice and downloaded all Helicobacter genomes (n=211) and then ran various k-mer signatures to determine relatedness. The first sample matched pretty well to certain Helicobacter species. However, the last three samples did not match too well (usually hovering at best around 48%). I had growing suspicions due to other approaches that the last three samples were probably not Helicobacter. I consulted the researchers and it is apparent they really do not know what it is either. They are wet lab biologists and I guess they just know these samples can grow in a Petri dish. They want me to computationally figure out what these samples are. Do you have any suggestions on how I should approach this now, possibly using the same sourmash resource? For example, is it possible for me to extract a subset of the complete genomes that are not just Helicobacter, but are basically all fungi, yeast, bacteria, etc.? I am a bit stuck on how to possibly do this with the assembly_summary_genebank text file. I would be so grateful for any advice.

ADD REPLYlink modified 7 days ago • written 7 days ago by LRStar20

Staying with sourmash you could go with their LCA type example, other than that Kraken2 would be my choice. Or a simple blast against genbank/refseq or (with the proteins) against nr. With the blast executable there's an executable to download full formatted blast indeces for either of the big databases, see here, too

ADD REPLYlink written 7 days ago by Carambakaracho1.6k
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1935 users visited in the last hour