Question

Recommended workflow for identifying the genomic location and copy-number of an insert with a known sequence from WGS Nanopore fastq files

0

Entering edit mode

7 months ago

dk0319 ▴ 70

I generated a new transgenic mouse through random multi-copy integration of a 10.316 Kb DNA fragment with a known sequence . We performed WGS using the PromethION flow cell. From the core I received 500+ fastq files that I have subsequently merged.

So far I have performed:

de novo assembly with Flye to produce a denovo_assembly.fasta file with all the haplotypes using

flye --nano-raw \
merged.fastq.gz \
--genome-size 2.6g \
--keep-haplotypes \
--scaffold \
--threads 128 \
--out-dir ./flye_output_haplo

I then aligned my insert.fasta file to the denovo_assembly.fasta using

minimap2 -t 196 denovo_assembly.fasta insert.fasta > alignment.sam

which produced the following sam output

Insert  10316   2   10311   +   contig_1719 4057986 20474   30784   10230   10329   0   tp:A:P  cm:i:1848   s1:i:10223  s2:i:10218  dv:f:0.0017 rl:i:0
Insert  10316   19  10311   +   contig_1719 4057986 10169   20468   10225   10312   0   tp:A:S  cm:i:1858   s1:i:10218  dv:f:0.0013 rl:i:0
Insert  10316   60  10207   -   contig_1719 4057986 9   10159   10074   10166   0   tp:A:S  cm:i:1827   s1:i:10068  dv:f:0.0014 rl:i:0

From this I gathered that my insert is on contig_1719 of my denovo assembly

To identify what chromosome contig_1719 belongs to I performed the following

minimap2 -cx asm5 -t196 --cs GRCm39.primary_assembly.genome.fa.gz denovo_assembly.fasta > asm.paf
paftools.js call asm.paf > var.txt

This generated a text file of all the assembly contigs relative to the reference

V   chr3    42255839    42255840    1   60  a   -   contig_1719 3151947 3151947 +

I then printed just the portion of the denovo_assembly.fasta corresponding to contig_1719 using

 awk '/contig_1719/{x=NR+68000}(NR<=x){print}' denovo_assembly.fasta > contig_1719.txt

And uploaded it into benching where I then performed auto annotation to mark my insert

The above workflow was successful, in that contig_1719 does contain approximately 4 (3 complete and one partial) copies of my insert based on searching pieces of the insert sequences in the contig_1719.txt file

However, the outlined workflow is a novice attempt to identify the location as well as the copy number of my insert. If someone with more experience with this procedure or just nanopore sequencing in general could provide recommendations for how to improve the workflow or share their work flow it would be greatly appreciated.

Nanopore WGS Long-Read • 508 views

ADD COMMENT • link updated 7 months ago by Ram 43k • written 7 months ago by dk0319 ▴ 70