Hi everyone, I did create my first graph using vcf from HGDP as follow:
vg construct -r ref.fa -v sub-chrXX.vcf.gz > pXX.vg
vg ids -j $(for i in $(seq 1 22); do echo p${i}.vg; done)
vg index -x all.xg $(for i in $(seq 1 22); do echo p${i}.vg; done)
vg prune -t 8 -k 45 -r p${SLURM_ARRAY_TASK_ID}.vg > pruned${SLURM_ARRAY_TASK_ID}.vg
vg index -t 8 --temp-dir temp -g wg.gcsa pruned{1..22}.vg -p -Z 32768 2>&1
Those steps worked just fine. Then I tried to map reads using the following command:
vg map -x all.xg -g wg.gcsa -f HGDP00607_1.fastq.gz -f HGDP00607_2.fastq.gz > HGDP00607.gam
And I get the following error
error [xg]: multiple hits for $chrUn_JTFH01000458v1_decoy$
However when I map only one set of reads (only forward for example) the mapping step goes just fine
To create the reads I downloaded all the forward and reverse reads available on HGDP for this individual and concatenate them as follow:
cat *1.fastq.gz > concatenated_1.fastq.gz
I've been looking around and cannot find this error reported anywhere. Is there something wrong in my pipeline or is that an issue of the software ?
If I cannot find how to fix it, I was thinking to map forward and reverse reads separately and then merge the 2 gam file together (?). Or maybe should I merge both fastq files into one and map and map the merged file ?
Thank you for your help !
I believe this would happen because you have a duplicate sequence in your FASTA (namely the contig that the error indicates). It's kind of confusing that this error comes during mapping though. Ideally the error would happen early in the indexing process.
In any case, you should be able to resolve it by removing all but one duplicate during the index construction step.
You may also be interested in using
vg autoindex --workflow map
to automate the indexing pipeline, rather than doing all of the steps manually like this.Hi Jordan thank you for your answer, I am still checking those fastq files but they are quite bigs so can't tell you yet what's wrong with them but I literaly only download the fastq from there and concatenate them (12 fastq forwards and 12 reverse). Will let you know if I find something there.
To remove duplicate while indexing should I use the option -o ? And would that be for both xg and gcsa ?
Otherwise I could run
vg autoindex --workflow map pruned{1..22}.vg
but before I wanted to know if using the solution i suggested earlier (merge both GAM after mapping or merge fastq before mapping) would trigger any issue in further steps (surjecting to bam files + calling variants) ? With my ressources at my lab, it took us almost 2 months to run the gcsa (ran for 2 weeks in total but regular HPC outage) so I'd love if I could not re-run that part :/Thanks again
I don't think there's any error in the FASTQ. The duplicate contig would be in the FASTA reference. I don't think there's any built-in functionality to de-duplicate sequences in the reference in
vg
, so you'd have to handle that as a preprocessing step. As a quick check to make sure this is in fact the issue, you could comparegrep ">" ref.fa | wc -l
andgrep ">" ref.fa | sort | uniq | wc -l
.If I'm right about this issue, then unfortunately yes, you would have to re-index. GCSA2 construction tends to be particularly slow on systems with slow disk I/O, so if you can arrange for a machine with faster drives, you might be able to speed it up.
The way to use
autoindex
for this pipeline would be:However, if you have issues with frequent hardware failures, then the manual pipeline might be better.
vg autoindex
isn't restartable if the hardware fails mid-run.Hi Jordan, So turnes out I don't have any issue with the reference fasta sequence, no redundant contigs. The mapping step took a long time but completed without any trouble. I decided to map forward and reverse reads separetly and concatenate GAM using "cat" as suggested here for later steps.
I am now merging, creating pack and calling variant. I am still unsure where that previous error came from but so far it is not causing any more issue. Will keep you updated about this behaviour. Thanks again !!
Hi Jordan M Eizenga ,
I finally run different steps :
1- Surject GAM to BAM
2- Call variant with HaplotypleCaller on the BAM
3- Run vg pack on the gam file
And those were fine. However, I do have an error while I ran
vg call xg.xg -k concat.gam.pack > HGDP00607_gam.vcf
The error message I get is kinda weird as there is multiple lines with the following message:
At the end of the file, I saw a segmentation fault error:
Do you believe that is due to the error I shared with you in the first place ?
Thanks for your help !
That output doesn't give me a lot to work with, but IMO segmentation faults are always bugs, even if there's something wrong with the input. Probably the best thing to do is make a bug report on the VG GitHub. You might need to provide the data to reproduce this crash for us to do anything with it.
Thank you Jordan for your help. Did post the all thing on their github. Here is the post, https://github.com/vgteam/vg/issues/4071
Thanks !