Parser for PAF to find structural variants
2
0
Entering edit mode
23 months ago
crimsontabaq ▴ 70

Need to evaluate two large (1Gb) chromosome-level assemblies of the same genome by means of finding large structural variations between the two (duplication, inversion, deletion etc). I am trying to use minimap2 to get this sort of statistics (similiar to somewhat classical nucmer - show-diff approach), but I couldn't find any parser for .paf files (only paftools.js from minimap' creator, but it does not produce desired statistics). Conversion of .paf to .delta and using dna-diff is somehow imperfect.

Do you know any parser of .paf files for finding stuctural variations? Or a workaroung of a problem comparing two large assemblies? Many thanks!

parser assembly evaluation • 1.3k views
ADD COMMENT
1
Entering edit mode

Why not use sam/bam files?

ADD REPLY
0
Entering edit mode

Good idea, sam is much more used. Can yous suggest a particular way of doing the task with sam? I am still not sure if I should do local alignment or global whole-genome one. Much obliged!

ADD REPLY
0
Entering edit mode

I would suggest taking a look at this approach: https://github.com/lh3/CHM-eval/tree/master/dip-call

ADD REPLY
0
Entering edit mode

Thanks a lot, great util. But still I was looking for large SV discovery, and what you shared produces only small SV - SNPs and indels.

ADD REPLY
0
Entering edit mode

I have a script which can convert paf to delta. I have not test for the variant calling using the delta file though. https://github.com/gorliver/paf2delta

ADD REPLY
1
Entering edit mode
18 months ago

Hi,

I suggest the Nucdiff or Syri, which can detect long (longer as they can) based on the assembly alignment. However I didn't get good result from them. I compaired two ~100Mb chromosomes and didn't even got an alignment result following the suggested pipeline! Well, syri was fast enough for the A.th example data. You can try them and maybe tell me about your runtime and etc.

Best,

Shangzhe

ADD COMMENT
0
Entering edit mode

Hi. I am curious to know what were the issues with these methods. It would be great if you could please share why the results were not good.

ADD REPLY
0
Entering edit mode

Sorry for the delay. They didn't perform well on the large-scale genomes, such as human. It'll take lot of resources.

ADD REPLY
1
Entering edit mode
18 months ago
colindaven ★ 3.0k

Dotplots might be useful if you have long and collinear contigs (Nanopore, Pacbio). If Illumina, well, forget it. One dotplot implementation is pretty good in the GUI package Ugene.

It's not an easy approach though. I believe assembly to assembly genome multimappings are a somewhat poor and unloved area of bioinformatics.

ADD COMMENT

Login before adding your answer.

Traffic: 1632 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6