Parser for PAF to find structural variants
2
0
Entering edit mode
3.1 years ago
crimsontabaq ▴ 70

Need to evaluate two large (1Gb) chromosome-level assemblies of the same genome by means of finding large structural variations between the two (duplication, inversion, deletion etc). I am trying to use minimap2 to get this sort of statistics (similiar to somewhat classical nucmer - show-diff approach), but I couldn't find any parser for .paf files (only paftools.js from minimap' creator, but it does not produce desired statistics). Conversion of .paf to .delta and using dna-diff is somehow imperfect.

Do you know any parser of .paf files for finding stuctural variations? Or a workaroung of a problem comparing two large assemblies? Many thanks!

parser assembly evaluation • 2.2k views
1
Entering edit mode

Why not use sam/bam files?

0
Entering edit mode

Good idea, sam is much more used. Can yous suggest a particular way of doing the task with sam? I am still not sure if I should do local alignment or global whole-genome one. Much obliged!

0
Entering edit mode

I would suggest taking a look at this approach: https://github.com/lh3/CHM-eval/tree/master/dip-call

0
Entering edit mode

Thanks a lot, great util. But still I was looking for large SV discovery, and what you shared produces only small SV - SNPs and indels.

0
Entering edit mode

I have a script which can convert paf to delta. I have not test for the variant calling using the delta file though. https://github.com/gorliver/paf2delta

1
Entering edit mode
2.7 years ago

Hi,

I suggest the Nucdiff or Syri, which can detect long (longer as they can) based on the assembly alignment. However I didn't get good result from them. I compaired two ~100Mb chromosomes and didn't even got an alignment result following the suggested pipeline! Well, syri was fast enough for the A.th example data. You can try them and maybe tell me about your runtime and etc.

Best,

Shangzhe

0
Entering edit mode

Hi. I am curious to know what were the issues with these methods. It would be great if you could please share why the results were not good.

0
Entering edit mode

Sorry for the delay. They didn't perform well on the large-scale genomes, such as human. It'll take lot of resources.

1
Entering edit mode
2.7 years ago

Dotplots might be useful if you have long and collinear contigs (Nanopore, Pacbio). If Illumina, well, forget it. One dotplot implementation is pretty good in the GUI package Ugene.

It's not an easy approach though. I believe assembly to assembly genome multimappings are a somewhat poor and unloved area of bioinformatics.