I performed WGS of a new transgenic mouse model generated by random integration of a 10.316Kb transgene with the goal of locating the insertion site/s and calculating copy-number.
The PromethION flowcell produced an output of 54.86Gb with 2.37 Million reads generated and I received this data in the form of 595 based called fastq files
My workflow thus far.
1) merged 595 fastq into a single merged.fastq.gz
2) generate denovo_assembly.fasta
with flye:
--nano-raw merged.fastq.gz --genome-size 2.6g --threads 129 --out-dir ./flye_output
3) align 10.316Kb insert.fasta
against denovo_assembly.fasta
:
minimap2 -c -P -L --cs=long --frag=yes --rmq=yes -t129 denovo_assembly.fasta insert.fasta > insert_to_denovo.paf
First record of insert_to_denovo.paf
:
Insert 10316 12 10316 + contig_1535 4057936 10162 20473 10289 10322 8 NM:i:33 ms:i:19639 AS:i:20448 nn:i:0 tp:A:P cm:i:1858 s1:i:10218 s2:i:10223 de:f:0.0017 rl:i:0 cg:Z:3911M2I15M4I261M1I248M1I980M1D27M1D1934M1I210M1I755M10D1M2D15M1D4M2D497M1D1185M1I250M cs:Z:=AGCTNNNNN......
From the output of insert_to_denovo.paf
I can see that my 10.316Kb insert.fasta
aligns with 10.322Kb of contig_1535
from my denovo_assembly.fasta
. Followed by a second alingment of length 10.333Kb a third alignment with length 10.206Kb and a 4th truncated alignment of length 5.137Kb.
I then extracted the fasta sequence for contig_1535
from denovo_assembly.fasta
with awk to make contig_1535.fasta
.
I uploaded contig_1535.fasta
to benchling and auto-annotated based on features generated from insert.fasta
.
The results of annotation indicated 3 nearly complete insertions and 1 partial insertion of insert.fasta
.
A screenshot of the annotation results (below) shows an insertion pattern of a head-to-head insert (Pink-Green), followed by 3 tail-to-head inserts (Green-Purple-Orange) with the partial insertion sequence appearing last in Orange.
NCBI blast of the Grey portion flanking the Orange partial insert indicates that this sequence corresponds to Chr3 of mm39. However there are no bases to the left of the Pink insert. As a result I cannot identify the leftmost insert site.
Is there a method that I can use to identify sequences that overlap with the leftmost portion of contig_1535 that might have been discarded during assembly?
Alternatively, is there a another workflow that might keep the leftmost bases, allowing me to find the insertion site?