Question

Genome Assembly QC from BAM files

1

Entering edit mode

7 weeks ago

SomeOne ▴ 260

Hello,

I am currently generating genome assemblies for fungal samples from Pacbio Hifi data. WHat i did so far is

Generated assemblies using FLYE assembler and ran QUAST + BUSCO to see assembly stats and Completness
- Flye Gave me really good assemblies but some of the chromosomes were still split into 2 scaffolds when comparing to a reference genome.
  1. Generated assemblies using Hifiasm assembler and ran QUAST + BUSCO to see assembly stats and Completness
- Hifiasm also gave me good assemblies and the split scaffolds where coming up as simgle chromosomes but these had too many extra scaffolds.
So i ran RagTag-Scaffold keeping the FLye assemblies as Query-input and Hifiasm assemblies as Reference-input
- this resulted in some really good assemblies and i got down to really good number of chromosoems. QUAST and BUSCO stats look really good.

Now i was wondering if their is any other way to evaluate the assemblies to see if their are ny mis assembles, repeat collapsed reagions or anything else which should be evaluated in the genome assemblies.

I have a vague idea that reads are aligned back to assembly to generated BAM files (which i have done using minimap2 -x ava-hifi) but i am not sure what to look for in these bam files. or how to evaluate the assemblies further.

ANy ideas/Hints in this regard will be really helpful.

Regards

QC assembly BAM HiFi • 7.9k views

ADD COMMENT • link updated 7 weeks ago by GenoMax 154k • written 7 weeks ago by SomeOne ▴ 260

0

Entering edit mode

Since this saga has been on going for a long time, it will be helpful to add a comment as to how you finally got to this point of what seem to be good assemblies.

It sounds like you only used PacBio HiFi data in this final iteration, It would be helpful for others to know for sure, as they decide what type of data to generate (not everyone will have the means to get illumina/ONT/HiFi data like you seem to have used over time, if I recall right).

ADD REPLY • link 7 weeks ago by GenoMax 154k

0

Entering edit mode

Hi,

I am not sure if i can call them good assemblies. Although based on QUAST and BUSCO stats, everything looks too good but somehow i have a doubt and for that i wanted to know how assemblies are evaluated further by aligning raw-reads to assembly. (If you can also point out some hints those will be great.)

Our initial attempt included ONT+Illumina sequencing for some samples to generate assemblies. This one did give us good Core-Chr but Accessory-Chr were too fragmented.
so we decided to go for PacBio HiFi as now it was cheaper.
For my own curiosity, i wanted to do assemblies based on atleat HiFi+ONT data but the results were not so good, as the N50 of ONT data was way less than N50 of HiFi. and read lengths too. (Hifi ~15kb and ONT ~8-9kb)

So just uing the HiFi data, I generated assemblies using FLYE, which gave contigs in range of 30-50 for initial assemblies and BUSCO (compared to fungi_odb12) were 99.7% 99.8% completness. Hifiasm on other-end Gave contings in 100 or more but BUSCO scores were same as >99.5% completness

I tested Quickmerge to merge assemblies, but it didnot work out for me. So i tested RAGTAG-scaffold. Which seems too work. BUSCO stats were same but in QUAST, #contigs < 30 adn n50 ~4.4mb

and now i am at this point to further analyze.

If you can also point out something, that will be great.

ADD REPLY • link 7 weeks ago by SomeOne ▴ 260

score 3 · Accepted Answer · 2025-09-23

3

Entering edit mode

7 weeks ago

colindaven 8.1k

You can have a look at the tools in the PAQman pipeline https://github.com/SAMtoBAM/PAQman, and maybe also Inspector https://github.com/Maggi-Chen/Inspector to evaluate assembly qc.

ADD COMMENT • link 7 weeks ago by colindaven 8.1k

0

Entering edit mode

Thank you. Based on the description these two look really what i wanted to see.

ADD REPLY • link 7 weeks ago by SomeOne ▴ 260

0

Entering edit mode

hello colindaven

I ran the INSPECTOR tool you mentioned on the genome assemblies. Below are stats for one of my sample (othere were similar). Can you comment on these, If they look good ?

enter image description here

ADD REPLY • link 7 weeks ago by SomeOne ▴ 260

1

Entering edit mode

yes they look excellent

ADD REPLY • link 7 weeks ago by colindaven 8.1k

0

Entering edit mode

Can you comment on these, If they look good ?

More importantly how do these assemblies compare (contiguous/number of chromosomes/size) to the RefSeq counterparts? Or are these previously unknown/unavailable organisms (unlikely, but possible).

ADD REPLY • link 7 weeks ago by GenoMax 154k

0

Entering edit mode

There are some reference assemblies. Some close ones are at chromosome level but exact reference are not chromosome level assemblies.

Other than that, Even with a closest reference, The Core Chromosoems are conserved and they look really good, the issue was with the Accessory chromosomes which are not complete in reference genomes. (attached an example)

enter image description here

In this image, on LEFT is reference genome which i a CHR level assembly, well annotated. It has 11 chore chromsomes and 1 CHR7 as accessory chromosme. On right we have another sample from same specie (verified via SequenceType). As you can see, the core is there and conserved but we have more accessory chromosomes than reference.

P.S: Plot is based on Genome 2 genome alignment via Nucmer with alignment_length >1000 and identity >95%

ADD REPLY • link 7 weeks ago by SomeOne ▴ 260

1

Entering edit mode

This may be another example where the assembly you obtained may be better than an existing one .. certainly as far as the accessory chromosomes go.

You should consider submitting the raw data and assembly to NCBI, sooner than later.

ADD REPLY • link 7 weeks ago by GenoMax 154k