Entering edit mode
6 days ago
xinguok794
▴
10
Hi, I input the genome purged.fa (QV=52) assembled by hifiasm and purged by haplotig into YAHS, and the resulting scaffolds_final.fa was directly reduced to 23 after passing through Merqury, I want to know what caused this? Below is my code:
#!/bin/bash
set -e
# --- Set variables (please check paths and filenames) ---
ASSEMBLY_FA="purged.fasta"
HIC_R1=/users/jieqyan/workspace/assembly/ref/Hi-C/AT_HiC_R1.clean.fastq.gz
HIC_R2=/users/jieqyan/workspace/assembly/ref/Hi-C/AT_HiC_R2.clean.fastq.gz
# Note: PICARD_JAR variable removed, using 'picard' as the executable command
THREADS=20
echo "--- Stage 1: Hi-C preprocessing begins ---"
# 1. Index the cleaned contigs
echo "1. Indexing the cleaned contigs..."
bwa index $ASSEMBLY_FA
samtools faidx $ASSEMBLY_FA
# 2. Align Hi-C reads and generate sorted BAM
echo "2. Aligning Hi-C reads and generating sorted BAM..."
bwa mem -5SP -t $THREADS $ASSEMBLY_FA $HIC_R1 $HIC_R2 | \
samtools view -@ 4 -bS - | \
samtools sort -@ 4 -o hic.sorted.bam
# 3. Mark and remove PCR duplicates
echo "3. Picard MarkDuplicates removing duplicates..."
# Ensure -Xmx comes before, using 'picard' command
picard -Xmx32g MarkDuplicates \
I=hic.sorted.bam \
O=hic_dedup.bam \
M=metrics.txt \
REMOVE_DUPLICATES=true
# 4. Filter and sort by read name
echo "4. Filtering non-unique mappings and sorting by name..."
samtools view -F 0x904 -h -b hic_dedup.bam | \
samtools sort -@ 4 -n -o hic_clean.namesort.bam
echo "--- Stage 1 completed. Output: hic_clean.namesort.bam ---"
yahs purged.fasta hic_clean.namesort.bam -o yahs_scaffolding_output
I would be very grateful if you could offer me any advice!
Looks like YAHS does an initial error correction step which you can skip with (--no-contig-ec)
Can you see if skipping the correction helps with the merqury quality?
Thank you for your reply! Unfortunately, despite running YAHS with
--no-contig-ec, the QV value of the resulting FASTA file did not improve further; in fact, it decreased further (QV: 19). Besides QV, I also evaluated LAI and busco. The results showed that both of these metrics were quite good (LAI=25, busco=100), but the QV value remained low. Strangely, after polishing the FASTA file again, the QV value rebounded to 52, but the fragmentation of the FASTA became even higher. I suspect that the significant decrease in QV value after HiC scaffolding might be an illusion, perhaps YAHS performed some kind of special masking process that affected Merqury's judgment?I don't believe merqury ignores softmasked (lower case) nucleotides so I don't think that is the case.
From what it sounds like, during the scaffolding, a lot of 'errors' are probably introduced at the scaffolded region. So, the number of variants in these localised positions are decreasing the QV value. You could check if the post-scaffolding polishing is only modifying variants at these locations. In terms of the assembly becoming more fragmented after polishing, this would only be the case if you allow scaffolds/contigs to be broken during this polishing step. You could therefore not allow the breaking of these scaffolds and just error correct bases. Then check again the QV.
Hope that is helpful