Question

Autosome/ sex chromosome assembly error

0

Entering edit mode

5 days ago

Megan • 0

Hi,

I'm doing a coverage analysis to identify the W chromosome in a chromosome level genome assembly from a female snake, (the genome on NCBI does not have chromosome names assigned to scaffolds). I found that there was an assembly error in the reference genome, and a chunk of an autosome scaffold is attached to the W scaffold. My question is, what tools can I use to separate out the W from this scaffold and create a new scaffold with just the W? Thanks!

genomics analysis sexchromosome tools • 278 views

ADD COMMENT • link updated 2 days ago by Kevin Blighe 89k • written 5 days ago by Megan • 0

0

Entering edit mode

If you know what the chunk is (I guess you could try define the edges by alignment to autosomes etc), can you not just remove it manually? Are you also sure that it is incorrect and that you are just not looking at some variation?

ADD REPLY • link 5 days ago by samuel.a.odonnell ▴ 630

score 0 · Answer 1 · 2025-11-06

Hi Megan,

Misassemblies like this are unfortunately common in sex chromosome regions (especially in ZW systems like snakes, where W coverage is often ~half that of autosomes in female data). Good catch via coverage analysis—that's often the smoking gun. Since the NCBI assembly lacks chromosome assignments, you'll want a targeted fix to split the scaffold without re-assembling everything. Below, I'll outline a straightforward workflow to identify the breakpoint and extract the true W portion as a new scaffold. This assumes you have the raw reads (Illumina?) and can align them to the problematic scaffold for precise boundaries.

Step 1: Pinpoint the Misassembly Junction

Use read alignments to confirm the autosome chunk and define exact coordinates (e.g., where coverage drops or mapping switches from autosome-like to W-like patterns).

Align reads to the full scaffold:
If not already done, map your female RNA-seq or DNA-seq reads (whichever you used for coverage) to the assembly with bwa mem or bowtie2:
```
bwa mem -t 8 reference.fa reads_R1.fastq reads_R2.fastq | samtools sort -o aligned.bam
samtools index aligned.bam
```
(Replace reference.fa with your full assembly FASTA.)
Compute per-base coverage:
Use mosdepth (fast and accurate) or samtools depth to generate a coverage track:
```
mosdepth -n -t 8 coverage aligned.bam
```
This outputs a aligned.regions.bed file with mean coverage per 250bp window (adjust with --window-size). Plot it in IGV or R (e.g., ggplot2) alongside BLAST hits of the scaffold against known autosomes— the junction should show a sharp coverage shift (autosome: full depth; W: ~50%).
Validate with BLAST:
Extract suspect regions (e.g., high-coverage chunk) and BLASTn against NCBI nt or your autosome scaffolds:
```
blastn -query suspect_chunk.fa -db nt -outfmt 6 -out blast_hits.txt -num_threads 8
```
This confirms the attached piece is autosomal.

If you have Hi-C data, HiCUP + HiCExplorer can also flag the weak interaction at the junction, but that's overkill if coverage suffices.

Step 2: Split the Scaffold

Once you have coords (e.g., W region: scaffold_123:10000-500000; autosome chunk: scaffold_123:1-9999), extract subsequences. Treat the original scaffold as a "chromosome" for indexing.

Index the assembly:
```
samtools faidx assembly.fa
```
This creates assembly.fa.fai for region extraction.
Extract the true W as a new scaffold:
Use samtools faidx to pull the W portion:
```
samtools faidx assembly.fa scaffold_123:10000-500000 > W_scaffold_new.fa
```
(Adjust coords; this outputs a FASTA entry ready to rename, e.g., >scaffold_W.)
Handle the autosome chunk (optional):
Similarly:
```
samtools faidx assembly.fa scaffold_123:1-9999 > autosome_chunk.fa
```
You can then merge this back into the original autosome scaffold (if known) using cat or a scaffolder like SCAF for proper linking.

For batch processing or scripting:
If you have multiple such errors, use BioPython for flexible slicing:

from Bio import SeqIO
from Bio.SeqRecord import SeqRecord
import sys

# Load full assembly
records = list(SeqIO.parse("assembly.fa", "fasta"))

# For your scaffold (assume index 0 for simplicity)
scaff = records[0]  # or filter by ID
w_start, w_end = 10000, 500000  # 1-based coords, adjust to 0-based for slicing
w_seq = scaff.seq[w_start-1:w_end]

# Create new W record
new_w = SeqRecord(w_seq, id="scaffold_W", description="Extracted W from misassembled scaffold_123")
SeqIO.write(new_w, "W_scaffold_new.fa", "fasta")

# Optionally, update original scaffold by slicing out the chunk
updated_scaff = SeqRecord(scaff.seq[:w_start-1], id=scaff.id, description=scaff.description + " (autosome chunk removed)")
SeqIO.write(updated_scaff, "updated_autosome.fa", "fasta")

Run with python split_scaffold.py.

Step 3: Update and Re-annotate

Replace in assembly: Use seqtk replace or manual sed to swap in the new files, then re-run QUAST or BUSCO for QC.
Re-assign chromosomes: With the clean W, propagate labels using chromosomer or manual based on coverage/syntax.
If automated breaking is preferred: For future-proofing, try REAPR (breaks scaffolds at low read-pair support):
```
reapr pipeline reads_R1.fastq reads_R2.fastq assembly.fa
```
It outputs a breaking.bed file with suggested cuts—great for coverage-discordant junctions.

This should get you a clean W scaffold without much hassle. If the junction is fuzzy (e.g., due to repeats), share a coverage plot or coords for more tailored advice. What snake species/assembly version is this?

Best,
Kevin