We've been working with HiSAT2
and HiSAT2-build
for mapping NGS data to a reference.
In the next step, we wish to create a new indexed reference from VCF data we've generated.
My PI says to use...
hisat2-build --snp ./MySNPs.vcf.gz --haplotype <something> Grch38.fa OutfilesIndexed
... to generate ...
OutfilesIndexed.1.ht2
OutfilesIndexed.2.ht2
etc.
The question is, based on what I see in the manual, I'm not sure I can use VCF format directly ...
--snp
Provide a list of SNPs (in the HISAT2's own format) as follows (five columns).
SNP ID <tab> snp type (single, deletion, or insertion) <tab> chromosome name <tab> zero-offset based genomic position of a SNP <tab> alternative base (single), the length of SNP (deletion), or insertion sequence (insertion)
For example, rs58784443 single 13 18447947 T
Use hisat2_extract_snps_haplotypes_UCSC.py
...although it DOES create files when I pipe a VCF in directly...but I don't know if they are reliable.
hisat2-build --snp ./MySNPs.vcf.gz Grch38.fa IDXXX
-rw-r--r-- 1 user grpname 8192 Nov 28 11:23 IDXXX.1.ht2
-rw-r--r-- 1 user grpname 0 Nov 28 11:23 IDXXX.2.ht2
-rw-r--r-- 1 user grpname 11294 Nov 28 11:23 IDXXX.3.ht2
-rw-r--r-- 1 user grpname 736462267 Nov 28 11:23 IDXXX.4.ht2
-rw-r--r-- 1 user grpname 0 Nov 28 11:23 IDXXX.7.ht2
-rw-r--r-- 1 user grpname 0 Nov 28 11:23 IDXXX.8.ht2
My understanding from here, is that I should use hisat2_extract_snps_haplotypes_VCF.py
to convert the VCF to the HiSAT2 format above. But when I run the VCF file through it I get nothing...
hisat2_extract_snps_haplotypes_VCF.py Grch38.fa ./MySNPs.vcf.gz IDXXX
-rw-r--r-- 1 user grpname 0 Nov 28 14:53 IDXXX.haplotype
-rw-r--r-- 1 user grpname 0 Nov 28 14:53 IDXXX.snp