I am a recent graduate of Bioinformatics. As a research intern, I have been tasked with creating a pangenome and then call variants using short read data. I came across vg and vg giraffe and found it useful but I have a few doubts regarding the workflow.
Basically, I find it confusing that there is one pipeline using manual indexing, and another using autoindexing. That is, I'm unsure whether to follow the classical manual indexing+gbwt+giraffe+call or follow autoindex+giraffe+call. I gave both a try, but need clarity or confirmation on which to follow. pipeline 1:
delly call -o samples_delly.bcf -g teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa ANP1.sorted.bam ANP4.sorted.bam PB27.sorted.bam TNT17.sorted.bam
bcftools view samples_delly.bcf > samples_delly.vcf
vg construct -r teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa -v samples_delly.vcf.gz -p -a -S > pangenome_graph.vg
vg autoindex --workflow giraffe --prefix pangenome_graph_ai --ref-fasta teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa --vcf samples_delly.vcf.gz --target-mem 12GB --threads 8
vg giraffe -Z pangenome_graph_ai.giraffe.gbz -m pangenome_graph_ai.min -d pangenome_graph_ai.dist -f ANP1_R1_paired_trimmed.fq.gz -f ANP1_R2_paired_trimmed.fq.gz > samples_ai.gam -p
vg convert -p -t 8 pangenome_graph_ai.giraffe.gbz > pangenome_graph_gbz_ai.pg
vg validate pangenome_graph_ai.giraffe.gbz -a samples_ai.gam
vg validate pangenome_graph_gbz_ai.pg -a samples_ai.gam
vg augment pangenome_graph_gbz_ai.pg samples_ai.gam -A samples_ai.aug.gam -m 2 -q 10 -v -t 16 > samples_gbz_ai.aug.pg -p
vg snarls -t 8 samples_gbz_ai.aug.pg > samples_gbz_ai.aug.snarls
vg pack -x samples_gbz_ai.aug.pg -g samples_ai.aug.gam -Q 5 -t 16 -o samples_gbz_ai.aug.pack
vg call samples_gbz_ai.aug.pg -r samples_gbz_ai.aug.snarls -k samples_gbz_ai.aug.pack -t 8 > samples_variants_gbz_ai.aug.vcf --progress
Output: snippet
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
Pseudomolecule_01   16422   >2459171>2459175    AGTGTTTGAGCGGTGTGTTCGATGCTGCACCCGGGGCAACTGGACCACACAAATCACACACACCGACCG   A   9.54243 lowad   AT=>2459171>2459172>2459173>2459174>2459175,>2459171>2459175;DP=0   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:0:0,0:-0.057016,-0.057016,-0.057016:0:-2.197225:0.131285:0
Pseudomolecule_01   21134   >2459320>2459322    TTCTCTCTCTCTCTCTCTCTCTCTCTCTCTCT    T   9.54243 lowad   AT=>2459320>2459321>2459322,>2459320>2459322;DP=0   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:0:0,0:-0.029361,-0.029361,-0.029361:0:-2.197225:0.067606:0
Pseudomolecule_01   49789   >2460217>2460220    GTGTTGCAAACTTAAGCTTAACTTGGACTTCAATTT    G   9.54243 lowad   AT=>2460217>2460218>2460219>2460220,>2460217>2460220;DP=0   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:0:0,0:-0.094744,-0.094744,-0.094744:0:-2.197225:0.218156:0
Pseudomolecule_01   62342   >2460611>2460640    TCATCATGTTGGAAAATTGGTCTTGAGAAGTTGAGTTTGATTCAATTGTGACCAAGTTTAAGTGGCATACACTTGATTAATATGGAATCAAGTAGTGTTATTATTATCTACATTTTATGTAGATTGTTGTAGTGTATTTATATTCTCAAAACCAATAATATGAACAAATTAATGCACAAATTTGGTCACTTAAGTATTAAATGGGAGATATATATGACATCATTCCATGTAGAAAAAAAAATGTATACAACAGCCAATAATGGGCACATTAATATTCAATGTGTATGTTATAAACATACGGATGCCCATGTGTCTGGAAAAGAGTCCAGATACGTCATAAGATGTCTAGATTCGATGAACACCTAAATGGCAGTTTCTTTTTCCGTTTAATGGCATTCATGAGGGCCATTCATGACTGGGCCATTAGGGCCCAGTAACCCCCATAAGGAAAGGCCTATATAAGGCCTTCTCCTCTCATTTGCATTTCACACCGACATCGAAGTTTCAAAGTTTCATAACATTCTTTTCTCTAAAATGTTCTCTCATGGCAGCATAATCTGCAAAGTTCTGTGTTTCTGAAATTCGTTGGGATTTGAGATTACAGTGCTCGTATCTCAAATCGAGAGTTGTTGTATCTTGAGAGACAAATTGCCGTTAAACCGTGAGCACTAGTACGGGACAATATTGTCTTAAGGGTAGAGTGTTTCACTCGCCTCAACTCATTAACGTCGGTTAAGTTTGTGCTGACTGTACTCACTGTGTCGGCTAAGTTTGTGCTGACTGTACTTATTGTTGGCTAAGTTTGTGCTAACTGTACTTATTGTATTGGCTAAATTGTGCTAACAATTTGGTGGATACGCCAAGTGCTCAAAATTATTGTTGTTTATCCAACAC  T   9.54243 lowad   AT=>2460611>2460612>2460613>2460614>2460615>2460616>2460617>2460618>2460619>2460620>2460621>2460622>2460623>2460624>2460625>2460626>2460627>2460628>2460629>2460630>2460631>2460632>2460633>2460634>2460635>2460636>2460637>2460638>2460639>2460640,>2460611>2460640;DP=0   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:0:0,0:-0.090614,-0.090614,-0.090614:0:-2.197225:0.208646:0
Pseudomolecule_01   91276   >2461517>2461520    CACCCAGACGCACGCATAACAAATCACACGCGCGCGCACAGACACACACAGATGCAGCAACTA C   9.54243 lowad   AT=>2461517>2461518>2461519>2461520,>2461517>2461520;DP=0   GT:DP:AD:GL:GQ:GP:XD:MAD    1/1:0:0,0:-1.130604,-1.130604,-1.130604:0:-2.197225:2.603306:0
(I tried using delly to check if variants are indeed present in my file, because the same workflow with the original vcf gave empty file with no results like case 2)
pipeline 2:
samtools faidx teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa
vg construct -r teak_tectona_grandis_26Jun2018_7GlFM_fmt_tp.fa -v maf135.vcf.gz -a -S -p > graph.vg
vg convert -f graph.vg > graph.gfa
vg index -x graph.xg -g graph.gcsa -t 8 -p graph.vg
vg gbwt -x graph.xg -E -o ref_paths.gbwt -p
vg gbwt -x graph.xg ref_paths.gbwt --gbz-format -g graph.gbz -p
vg index graph.xg -j graph.dist 
vg minimizer -t 16 -d graph.dist graph.gbz -o graph.min
vg giraffe -Z graph.gbz -m graph.min -d graph.dist -f ANP1_R1_paired_trimmed.fq.gz -f ANP1_R2_paired_trimmed.fq.gz -p > ANP1_sample_giraffe.gam
vg pack -x graph.gbz -t 8 -Q 5 -g ANP1_sample_giraffe.gam -o ANP1_sample.pack
vg snarls -t 8 graph.gbz > ANP1_sample.snarls
vg call graph.gbz -r ANP1_sample.snarls -k ANP1_sample.pack -z -t 8 > ANP1_variants_vg_call.vcf 
Output: empty file
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  SAMPLE
I am also attaching samples of my vcf and fasta VCF:
##fileformat=VCFv4.2
##fileDate=20231208
##source=PLINKv1.90
##contig=<ID=Pseudomolecule_01,length=20658029>
##contig=<ID=Pseudomolecule_02,length=19432826>
##contig=<ID=Pseudomolecule_03,length=18399125>
##INFO=<ID=PR,Number=0,Type=Flag,Description="Provisional reference allele, may not be based on real reference genome">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  1_1 2_1 3_1 4_1 5_1 6_1 7_1 8_1 9_1 10_1    11_1    12_1    13_1    14_1    15_1    16_1    17_1    19_1    20_1    21_1    22_1    23_1    24_1    25_1    26_1    27_1    28_1    29_1    30_1    31_1    32_1    33_1    34_1    35_1    38_1    40_1    41_1    42_1    43_1    44_1    45_1    46_1    47_1    49_1    50_1    51_1    52_1    54_1    55_1    56_1    57_1    58_1    59_1    60_1    61_1    62_1    63_1    64_1    65_1    66_1    67_1    68_1    69_1    70_1    71_1    72_1    74_1    75_1    76_1    77_1    78_1    79_1    81_1    82_1    83_1    84_1    85_1    86_1    87_1    88_1    90_1    91_1    92_1    93_1    94_1    95_1    97_1    98_1    99_1    100_1   101_1   102_1   103_1   104_1   105_1   107_1   110_1   111_1   112_1   113_1   114_1   115_1   117_1   118_1   119_1   120_1   121_1   122_1   123_1   124_1   125_1   127_1   128_1   129_1   130_1   131_1   132_1   133_1   134_1   135_1   136_1   137_1   138_1   139_1   140_1   141_1   142_1   143_1   144_1   145_1   146_1   147_1   148_1   149_1   150_1
Pseudomolecule_01   6175    Pseudomolecule_01_6175  C   T   0   .   PR  GT  0/0 0/0 0/0 0/1 0/0 1/1 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 1/1 0/0 0/0 0/0 0/0 1/1 1/1 0/0 0/0 0/1 1/1 0/0 0/1 1/1 0/0 1/1 ./. 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/1 1/1 ./. 0/0 0/0 0/0 0/0 0/1 1/1 0/1 1/1 ./. 1/1 1/1 1/1 1/1 0/0 0/0 0/0 1/1 0/0 0/0 ./. 1/1 0/0 0/0 0/0 0/0 0/1 1/1 0/0 ./. ./. 0/1 1/1 0/0 0/0 0/1 ./. 0/0 1/1 0/0 0/1 0/1 0/0 1/1 0/0 0/0 0/0 0/0 0/0 1/1 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 1/1 1/1 0/0 0/0 0/1 0/0 0/0
Pseudomolecule_01   6205    Pseudomolecule_01_6205  G   A   0   .   PR  GT  0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/1 0/0 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 ./. 0/0 0/1 0/0 0/0 0/0 0/1 0/0 0/1 ./. ./. 0/0 0/0 0/1 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1
Pseudomolecule_01   6229    Pseudomolecule_01_6229  C   T   0   .   PR  GT  0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 1/1 0/0 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/1 0/0 0/1 0/0 0/1 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 ./. 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/0 0/0 ./. 0/1 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/0 0/1 0/1 0/0 0/0 0/1 0/0 0/1 ./. ./. 0/0 0/0 0/1 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 ./. 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/0 0/1 0/0 0/0 0/1
FASTA:
>Pseudomolecule_01
gacaaaggggaagtttgagatattttgattacacaagaacaatgaaacattaactgcatgatattagctacggcaataagaaataataaaggactgactg
ttgaagaacttcctggtcaacttgaggtattttatatgttaaacaaagtcaactttttctgttaactcagattgaacaaacacacacatgaaccatccaa
acgttggcttattaatctttggattattgcgtttacaattcttgcacttcttttcttttcttcaaaaactcattcaaaactgacatagcttcagtgaact
gagcagaatcccaagatgcctgcttattgcgctcatagtgattttcaccatcagagaggactgcagaatgtgatgtatactgactccctcgactatcctc
tggatcaattatatcagttggtaaagcccactccatctttttcttatctactaacttctgtcgcaaattcagctttttagcctcaacaatatcaccttct
tttatcagttccaattgagaaacacaattctccacctaaatcaaacacatgataTtttagctccaataacacactaccagtatgatgagtagaagTccaa
aatgagcacaataagcttatacaaactcacctctgccctgcttgctcgaaactgaaaacaatAaaaacacgttttgttcagaaggctattcagtgtatta
acaagcaatggattgtatgctggCgaaacaaggtcaatgtgaccacaatgacctgtgcaatgataAgcacgttggccacatgattgacatctgcatataa
caagatAagcatgaatacacattaatacactgcatgagatttttatccagattatatatacgtacttggATgcatcatcaaggggtcccatagccgaatc
ataaagtccgccagggattggcttgtccagtatgtcaagcaaattcggattcgtgatcttcacaacgctctggcgcctcacttctacatccgtcatgaac
ccaaaatgcacggcctccacaacctcagtcgcctaaataaaaagtgccaaacatcagaGaatagctagcaaaaaaatttattaccgcatgaaaagaatac
attaatagctaaatttgtttccaattcataatttagggatttaattccaaaaactcattcaagacaataacaaaataaacaatatcatgatgttttaaag
caacaacagcaacaacaactaaagggaaaaATACAATGAACATATTGTTTTACCATGGTTCACCAGTTGGCCCAAGGTCGGAATGGCCTAAGTATCAATA
ACTTCGAATTGCTTAGAAATAAATGACAAAAATTCAACGAGAAAAATATCAATTTTCTGAAAACTGAAGGCCAATTCGGTGcttccagCttcttgaagag
aagcttggttcccagctggtgcggtttgtaaactatacaaactggggTtttggaaaaccctatcagacaAaggtatatttGagaCagagagggggGGagg
cgcagatgggtcgggcccaccataaattagaacagtcattacctgaaaaaatcagcattataataacaaAtattctctttaaactttcctccaaattaaa
caaaaaaaaaaaaaaaatctttaaactttgcttcattttttattcttgaattacctcaaaatacacaatatatattaaataattttacaaATatatatat
atgacatatattaaaaaatattaaaataaaagatgaaatattatttgtaatagatagatggatccactgtttttcaattgtttcttgtttatttaattta
aaattagatttcacattctcgtcgaggagtgtaagagaattttaagagaattttattgattcctggaaacatatatattatgtttggtggAaggaaaaga
aatgtatattgtaaatattttAgaaaaattgtcttttttctctcaactacagtgatggATATTGTAacttctagCaatttcttataagaatatttggcca
aacagagatatatttaccacaaatttaaatattaagggacgtcaatttaaatacacatttatcgagttattaaatttcaattgtaatatcttGgtgtaca
ttggttgattaTaatatcaaatTaatatatatcccGaaacttaggggtgcacattgacacaagacaacttgTttGagggacgttgtattaattgactctt
gatgaactatGacgacttatttaataccaaatgtTaagGgagaatttcatttgactttatgctatataaaattttAatatgtttggatatgaattgttta
agagaaaatttgtagaaaagtgTaatgTattaggttggAAaaaaaatagaaaatataaatgaaaaatgtaatgTaTTaaaaaaaaaTacacacAcacaca
cacaaattaGaaaacataaatgtatttgtttgtgttatgaagtaagaatttggtgggaggaaaaaaCTaaCaTacatggcagtaatacatatgattttta
gtgagatcaacccttgagaaaaattaaataaaaatagttgaagggAaagaggtccaggaccatctgtgctatattatgccctagtattAgggttatttta
gcaCtgcacattacattctaatacttcattattatgtatataaatacattggtgaaataaaatacagaattaaaacaacataaaataaaatatgttcctt
gaagaagaccagaaatcaactggagaaaatatggattttcttggataaatttctccaactacaattatattatacttagttcccccccccccctcacgtt
tattgcagctcttccagagtttaatatccaatttctcattcattcgctattttttcaatttgttttcatacttgtgatcacaattcatcaacttctttag
agacttttcattaaaaatttggcaaattgtgaagcaagattcatgaaattcacacataatttattatttaaatgaaacgtttgtaaatttgttatttgta
ggaagagaagtgcaggtcattgcaatcatataagaaatatcatacttttatcacgaaaacttttgcaaagatgtcatcaacaggttagagatcaatgtac
ctttcagaagtcagaacaactcatacatgcagaaaacaaagcgagagcactgaaaacgtaattctcgttgagcaaagtgaccttcacaaaagtttaatgg
agtaccgagcataatttgcattttccatctactcaaaacattaacagtagttgcaattacagccgttggaaaaacaatcaccccccgttatctcaagtaa
ttgctggaagcaagatatcaattccacactagaaaccttactaccgctggtaaacaaagaaaccgaaatcgagaatcatatcaaggatgattaatacgaa
aatacacgcactacagaatagagcaactgttatgcctatcacaaggtatgatcacctccgaacagggtggcaccacacccctaccaggtgcagctccacg
gccggcagcctgagcctgccaaagagaaattcaagaatcagaagaatcaaccagataaaagttaattccatgagtaaccttctcaaatgcagtactcaga
caaaacatgttattcagcacaagagcaagaacaacttactctagctcgcatggcaacagcacgtccacgtccaactccaagtgccgaacccttgcccttt
aaggtggaaataaaacggatcagcaatatttgagcaggagaagccaaaatgaataaatgggtagcgacagaatgtaccttgatcctagcttccaagcgct
taaacattggagcattcttaagcatatctggaattaccatgaacctggtagaattgagaaattaataacagtccacataaaaattgatgcaaataagcac
tcccaatagggaatgagtatacacctactgactttgctgcctctaataaagacatgctcaagttgcgagaccttcccatcctgtaataaagttgaccaga
ttagcactccaaaatagataagataatgagctttaccattattcactaaaacctgcaaaaaaaactcaaccaccacattgcatctttttttttttttttt
aaatttaattactagatataattatatagcatacagtctgcatttctgataatcactatttgagtcagaacttaacagtatatttacaGCATCAAATTTC
TCATGCCCTATTACTCATCTTCCTTCATGAATAAAGGCTCTCATAACCACTTCCTTTGCCAACACTAAAGAGTTCTATATATATATTTTTGATTACTTTG
ACTCCTGTGCTTATCTTGATTAAGAGAGTGGGCCGAGTGGCAACACTTGCAAAACTAGAACAGTTTTCAGAAATCAGCATTACAATGCCCATGTTCTATG
edit: have attached outputs of these workflows
I proceeded with the
vg autoindexworkflow as follows:I used a different VCF input from delly tool since the older VCF was not having any SVs. Its snippet is given below:
But the final step of calling variants using vg call is giving vcf files where the variations are same throughout difference samples (ANP1, ANP4 etc). The variant VCFs are having different file size but same variation data?
Why is this so? What I actually need is a variant file with all the SVs in single VCF. Since none of the "merging" of intermediary files works, I though of calling individually and then concat or merge that.