Question

Recommendations for extending contigs from denovo assembly to identify SV insertion sites on chromosome

0

Entering edit mode

21 months ago

dk0319 ▴ 70

I performed WGS of a new transgenic mouse model generated by random integration of a 10.316Kb transgene with the goal of locating the insertion site/s and calculating copy-number.

The PromethION flowcell produced an output of 54.86Gb with 2.37 Million reads generated and I received this data in the form of 595 based called fastq files

My workflow thus far.

1) merged 595 fastq into a single merged.fastq.gz

2) generate denovo_assembly.fasta with flye:

--nano-raw merged.fastq.gz --genome-size 2.6g --threads 129 --out-dir ./flye_output

3) align 10.316Kb insert.fasta against denovo_assembly.fasta:

 minimap2 -c -P -L --cs=long --frag=yes --rmq=yes -t129 denovo_assembly.fasta insert.fasta > insert_to_denovo.paf

First record of insert_to_denovo.paf:

 Insert 10316   12  10316   +   contig_1535 4057936 10162   20473   10289   10322   8   NM:i:33 ms:i:19639  AS:i:20448  nn:i:0  tp:A:P  cm:i:1858   s1:i:10218  s2:i:10223  de:f:0.0017 rl:i:0  cg:Z:3911M2I15M4I261M1I248M1I980M1D27M1D1934M1I210M1I755M10D1M2D15M1D4M2D497M1D1185M1I250M  cs:Z:=AGCTNNNNN......

From the output of insert_to_denovo.paf I can see that my 10.316Kb insert.fasta aligns with 10.322Kb of contig_1535 from my denovo_assembly.fasta. Followed by a second alingment of length 10.333Kb a third alignment with length 10.206Kb and a 4th truncated alignment of length 5.137Kb.

I then extracted the fasta sequence for contig_1535 from denovo_assembly.fasta with awk to make contig_1535.fasta.

I uploaded contig_1535.fasta to benchling and auto-annotated based on features generated from insert.fasta.

The results of annotation indicated 3 nearly complete insertions and 1 partial insertion of insert.fasta.

A screenshot of the annotation results (below) shows an insertion pattern of a head-to-head insert (Pink-Green), followed by 3 tail-to-head inserts (Green-Purple-Orange) with the partial insertion sequence appearing last in Orange.

NCBI blast of the Grey portion flanking the Orange partial insert indicates that this sequence corresponds to Chr3 of mm39. However there are no bases to the left of the Pink insert. As a result I cannot identify the leftmost insert site.

Is there a method that I can use to identify sequences that overlap with the leftmost portion of contig_1535 that might have been discarded during assembly?

Alternatively, is there a another workflow that might keep the leftmost bases, allowing me to find the insertion site?

Nanopore WGS Long-Read SV Assembly • 1.5k views

ADD COMMENT • link updated 21 months ago by Brian Bushnell 20k • written 21 months ago by dk0319 ▴ 70

score 0 · Answer 1 · 2023-09-29

0

Entering edit mode

21 months ago

shelkmike ★ 1.6k

Some time ago I made a tool Elloreas (https://github.com/shelkmike/Elloreas) that iteratively extends a contig using long reads. It can be useful for your task.

ADD COMMENT • link 21 months ago by shelkmike ★ 1.6k

score 0 · Answer 2 · 2023-10-16

BBTools has an assembler, Tadpole, that can extend sequences using kmer from other reads:

tadpole.sh in=contigs.fa out=extended.fa extendleft=100 extendright=100 mode=extend extra=reads.fq k=75 ibb=f

It won't extend through forward branches in the graph... although it can extend through backward branches if you set "ibb=t" instead of "ibb=f". Although this is designed for very accurate reads, it will work with Nanopore if the quality is sufficiently high and k is sufficiently low.