Hello,
I have some genome of fungi assembled using HiFi reads. Though the genomes look really good (N50 > 4mb and BUSCO > 99.5% completness) i also wanted to have a look at the QV scores for the assemblies.
For this i tried running Inspector tool which calculates QV based on structural error and small scale base substitutions etc and the QV score were fairly high (QV > 65) but for one sample the score was 49.51
Keeping in mind that for all samples (11) i used same pipeline and same flags in all tools so far.
my pipeline structure was something like this
HiFi-bam > convert to fastq/a > hifiasm > purge-dups > mito-remove > coverage QC and high/low coverage contigs removal > polishing (nextPOLISH2 using hifi and Illumina reads) > QC with Inspector
The Inspector stats look like this
I am not sure why there are still some small scale assembly error specifically for sample PB02 even after 1 round of polishing.
I also tried to calculate QV sroce using mryl+merqury-v1.3 using commands
## 1. Get the right k size
genome_size=$(awk '/^>/ {next} {n+=length} END{print n}' "$current_asmFASTA")
K=$(best_k.sh $genome_size | tail -n1 | awk '{print int($1+0.9999)}')
## 2. Build k-mer dbs with meryl
meryl count k=$K threads=$threads output $merquryOUT/${sample}_reads.meryl $readsFASTQ
## 3. Run Merqury to get QV and spectra
cd $merquryOUT
merqury.sh ${sample}_reads.meryl ${sample}.fasta ${sample}_merqury
and got following results in <sample>_merqury.qv files
from merqury wiki (https://github.com/marbl/merqury/wiki/2.-Overall-k-mer-evaluation) each column means this
- Assembly of interest. Both is the combination of the above two.
- Total (present) k-mers uniquely found only in the assembly
- Total (present) k-mers found in the assembly
- QV
- Error rate
QUESTIONS
- why do inspector results show substitution errors even after polishing.
- am i using the merqury the right way ? if not can you recomend any tutorial or corrections.
Thank you
I will assume that the HiFi sequencing, for all strains, all came from the same batch and that they have similar depth and read length etc
I would say the most likely reason therefore is that there is something particular about the strain. Especially considering the merqury QV also confirms that there are more 'errors' in this strain.
My first guess would be that there is some sort of SV such as a large duplication/aneuploidy with mild heterozygosity that is collapsed in the assembly. For this I would check the coverage across each of the contigs in the assembly etc.
Alternatively, you could check if the polishing if actually working within all contigs.