Question

Construction of a pangenome using multiple genome sequences and read data

0

Entering edit mode

3 months ago

shcho • 0

Hello I'm building a pangenome using the genome sequences and read data of multiple species belonging to the same genus. I have a question about the process.

Which of the two methods below is more accurate when performing the 'vg constract' process? Also, are there any websites or papers where I can verify the basis for this?

Method 1

Map the read data to each reference sequence (fasta1, fasta2, fasta3, ..., fasta10).
Explore genetic variants using the mapping results (using GATK or deepvariant).
Generate vcf files (vcf1, vcf2, vcf3, ..., vcf10) for each reference sequence (fasta1, fasta2, fasta3, ..., fasta10).
Perform 'vg constract' using the fasta1, fasta2, fasta3, ..., fasta10, vcf1, vcf2, vcf3, ..., vcf10 files.

Method 2

Merge the three reference sequences (fasta1, fasta2, fasta3, ..., fasta10) into a single sequence to create a merge reference sequence (fasta11).
Map the read data to the merge reference sequence (fasta11).
Create a vcf file (vcf11) based on the merge reference sequence (fasta11).
Perform 'vg constract' using the fasta11 and vcf11 files.

Thank you very much!

vg • 700 views

ADD COMMENT • link updated 3 months ago by colindaven 8.1k • written 3 months ago by shcho • 0

score 2 · Answer 1 · 2025-08-12

I would start using minigraph-cactus to create a pangenome from the genome fasta files alone. Alternatively PGGB. This will already give you a bunch of variants in VCF format.

Links here : https://github.com/colindaven/awesome-pangenomes

If you want to use short or long reads, you can align these to your GFA pangenome (don't use minigraph alone as it creates a less compatible rGFA, not GFA pangenome) using vg giraffe or graphaligner respectively.

Then people commonly use vg surject to a reference sequence, and use linear SNP callers like GATK or deepvariant. Why ? Because there are very few pangenome SNP callers, and they are feature poor and very slow.