Question

How to create structural variants ground truth for alignment of two long-read genome assemblies?

1

Entering edit mode

11 months ago

Thomas ▴ 20

Hello everyone,

I'm a student in the area of genomics.

I have two genome assemblies from long reads (from haploid genomes). One is the reference of the organism (K. phaffii, a yeast), which represents the wild type. The other (the query) is an assembly of an K. phaffii strain, which contains a few genomic modifications, and which was derived from the wild type K. phaffii (the reference). I want to use this data to create a ground truth set of structural variants (SVs) (a file containing the "true" structural variants which are present in the query).

I tried this by running two SV callers, which can take assemblies as their input, SVIMasm and Assemblytics. Additionally, I also employed the tools NucDiff and the MUMmer dnadiff function to get info about the differences between these two assemblies. My idea was that the consensus of those 4 tools will give a confident guess about the "real" structural variants (SVs) inside the query.

However, these four tools heavily disagree and the consensus between them is very limited. I then tried to visualize the alignment between these two assemblies with tools such as IGV and D-Genies, but I was unable to manually find SVs from that comparison.

Therefore my question: How would you approach creating the ground truth in my situation, given that you have these two assemblies of the reference and the query and cannot perform additional laboratory experiments.

I would be very thankful for recommendations,

Kind regards,

Thomas

yeast assembly structural-variation SV-callers • 871 views

ADD COMMENT • link 11 months ago by Thomas ▴ 20

score 1 · Answer 1 · 2023-09-23

Hi,

D-genies is using minimap2 to align both genomes and minimap2 is chaining local alignments to produce a global one. If the SV are small or medium size insertion or deletion it is possible that they will be lost in the chains. You can change this behavior with the -g parameter https://lh3.github.io/minimap2/minimap2.html

Another solution is to align the long reads of you second assembly, if you have access to them, on the reference and call SVs from the alignment. There are several SV callers including SVIM and PBSV which have been used with success in some of our projects. https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02551-4

Cheers,

Christophe