Question

Tools for reference-guided de novo assembly of bacterial genome

1

Entering edit mode

6 months ago

analyst ▴ 50

Dear all, Please suggest which tool should i use for optimal assembly of bacterial fastq reads. Which approach is good if reference genome is available: de novo assembly or reference-guided de novo assembly? And also do spades assist with reference guided assembly?

Also I read threads about --trusted-contigs parameter of spades but could not understand clearly because people say that it merges assembly not guide assembly? Isn't the statement from spades assembler that "trusted contigs will be used for graph construction, gap closure and repeat resolution", tells us that it guides assembly?

Many thanks!

reference-guided genome assembly spades Bacterial • 1.1k views

ADD COMMENT • link 6 months ago by analyst ▴ 50

score 2 · Answer 1 · 2023-10-27

2

Entering edit mode

6 months ago

Brian Bushnell 20k

If you are dealing with a very minor strain variation - a handful of SNPs, maybe some short indels, but no structural variations - you might want to use a reference-guided assembly. You can do this, for example, by aligning the reads to the reference and then running a consensus program (e.g. BBTool's consensus.sh) to produce a revised reference. The advantage of this is that if you start with a single-contig reference, you will end up with a single-contig assembly; and even areas where coverage drops to zero will stay intact.

If you are dealing with anything more than minor strain variation (which you can easily determine by aligning to the reference, calling variants, and seeing what you get; or using comparesketch.sh to compare the fastq to the reference to calculate ANI, etc) then I would just run Spades in pure denovo mode; that's always the safest. That said, I've never used its trusted-contigs parameter so that might be interesting to experiment with.

ADD COMMENT • link 6 months ago by Brian Bushnell 20k

0

Entering edit mode

Thank you so much Brian Bushnell for such a detailed answer! I have to perform annotation and to construct a phylogenetic tree for a total of 19 samples. For variant calling i used snippy on fastq reads. Do you mean i have to use reconstructed genome for variant calling of bacterial WGS reads instead of reference genome? Please give your valuable suggestions.

And yes i experimented spades with --trusted-contigs flag too :) I used reference bacterial genome (fasta file). There are other tools too like Unicycler, abyss. Which tool do you prefer for bacterial reads assembly?

Here is my Quast report with --trusted contigs option:

spades with reference

Spades assembly without reference option:

spades without reference

Also denovo assembly through abyss

Assembly through abyss

Assembly through unicycler

Assembly through unicycler

Please suggest that which is the better assembly approach here.

ADD REPLY • link 6 months ago by analyst ▴ 50

1

Entering edit mode

For bacterial genome assembly, JGI (where I work) strictly uses Spades in pure denovo mode. Of course we assemble all kinds of bacteria that are closely related to others, and we have whole plates of bacteria that are 99% ANI to each other, but we still just use Spades in pure denovo mode because 1) it would take a huge amount of manual effort to figure out which ones have the same structure and thus could use each other's contigs for scaffolding and 2) it's just not safe to use organisms to scaffold other organisms unless you are absolutely confident that they have no structural variations, and we never know. If you compare two organisms and they are 99.9% ANI then it's pretty likely that they have no large-scale SVs. There's no guarantee, but in that case I'd definitely try a reference-guided assembly instead of de-novo. Then you can map reads and call variants, and if you get lots of variants closely-spaced in certain regions... that indicates a structural variation and you need to abandon the reference-guided assembly approach.

JGI often has plates of bacteria that have 99% ANI to each other, but who knows if the 1% difference is random SNPs or some big structural variation. So we don't do reference-guided assembly since it doesn't work on a large scale. But in individual cases, it can give you a much better assembly if you have a single-contig assembly that you just want to modify to reflect some SNPs or short indels.

For annotation, unless you have a specific pipeline you plan to use, I'd suggest:

callgenes.sh in=assembly.fa out=genes.gff passes=2

Snippy seems like a neat tool and I am going to look into it, but as far as I can tell, it's basically a wrapper for freebayes which is a subpar variant-caller. If you want to properly call variations, I would recommend aligning your reads to the reference, and then calling variants from that using a traditional variant-caller. For example:

bbmap.sh in=reads.fq out=mapped.sam ref=ref.fa
callvariants.sh in=mapped.sam out=vars.vcf ref=ref.fa

Then you get all the advantages of paired reads for properly mapping in repetitive areas, the ability to detect long indels, and accurate variant-calling. Of course you can use Snippy too, but I'd advise you to compare its output to other programs. My experience with FreeBayes was that it generated vast quantities of false-positives.

ADD REPLY • link 6 months ago by Brian Bushnell 20k

0

Entering edit mode

Thank you so much Brian Bushnell for such a detailed answer. I am new to this field can you share any helping material (papers or tutorials or pipelines) that you think, will help me to perform bacterial genomics analysis e.g., assembly in an appropriate way like when to use which approach as you discussed above.

Many thanks!

ADD REPLY • link 6 months ago by analyst ▴ 50

0

Entering edit mode

All of the samples are quality passed do i still need to perform filtering/cleaning? Which tool is best for filtering of bacterial WGS reads?

ADD REPLY • link 6 months ago by analyst ▴ 50