Hello I'm building a pangenome using the genome sequences and read data of multiple species belonging to the same genus. I have a question about the process.
Which of the two methods below is more accurate when performing the 'vg constract' process? Also, are there any websites or papers where I can verify the basis for this?
Method 1
- Map the read data to each reference sequence (fasta1, fasta2, fasta3, ..., fasta10).
- Explore genetic variants using the mapping results (using GATK or deepvariant).
- Generate vcf files (vcf1, vcf2, vcf3, ..., vcf10) for each reference sequence (fasta1, fasta2, fasta3, ..., fasta10).
- Perform 'vg constract' using the fasta1, fasta2, fasta3, ..., fasta10, vcf1, vcf2, vcf3, ..., vcf10 files.
Method 2
- Merge the three reference sequences (fasta1, fasta2, fasta3, ..., fasta10) into a single sequence to create a merge reference sequence (fasta11).
- Map the read data to the merge reference sequence (fasta11).
- Create a vcf file (vcf11) based on the merge reference sequence (fasta11).
- Perform 'vg constract' using the fasta11 and vcf11 files.
Thank you very much!