Question

Passing VCF file directly into ```HiSAT2-build --snp --haplotype```

0

Entering edit mode

6.4 years ago

rightmirem ▴ 70

We've been working with HiSAT2and HiSAT2-build for mapping NGS data to a reference.

In the next step, we wish to create a new indexed reference from VCF data we've generated.

My PI says to use...

hisat2-build --snp ./MySNPs.vcf.gz  --haplotype <something> Grch38.fa OutfilesIndexed

... to generate ...

OutfilesIndexed.1.ht2
OutfilesIndexed.2.ht2
etc.

The question is, based on what I see in the manual, I'm not sure I can use VCF format directly ...

--snp
Provide a list of SNPs (in the HISAT2's own format) as follows (five columns).

SNP ID <tab> snp type (single, deletion, or insertion) <tab> chromosome name <tab> zero-offset based genomic position of a SNP <tab> alternative base (single), the length of SNP (deletion), or insertion sequence (insertion)

For example, rs58784443 single 13 18447947 T

Use hisat2_extract_snps_haplotypes_UCSC.py

...although it DOES create files when I pipe a VCF in directly...but I don't know if they are reliable.

hisat2-build --snp ./MySNPs.vcf.gz   Grch38.fa  IDXXX

-rw-r--r-- 1 user grpname      8192 Nov 28 11:23 IDXXX.1.ht2
-rw-r--r-- 1 user grpname         0 Nov 28 11:23 IDXXX.2.ht2
-rw-r--r-- 1 user grpname     11294 Nov 28 11:23 IDXXX.3.ht2
-rw-r--r-- 1 user grpname 736462267 Nov 28 11:23 IDXXX.4.ht2
-rw-r--r-- 1 user grpname         0 Nov 28 11:23 IDXXX.7.ht2
-rw-r--r-- 1 user grpname         0 Nov 28 11:23 IDXXX.8.ht2

My understanding from here, is that I should use hisat2_extract_snps_haplotypes_VCF.py to convert the VCF to the HiSAT2 format above. But when I run the VCF file through it I get nothing...

hisat2_extract_snps_haplotypes_VCF.py Grch38.fa  ./MySNPs.vcf.gz  IDXXX

-rw-r--r-- 1 user grpname          0 Nov 28 14:53 IDXXX.haplotype
-rw-r--r-- 1 user grpname          0 Nov 28 14:53 IDXXX.snp

SNP snp next-gen alignment hisat2 • 3.3k views

ADD COMMENT • link 6.4 years ago by rightmirem ▴ 70

score 2 · Accepted Answer · 2017-11-30

So, found my own answer. I'll leave this here in case the question comes up again.

The --snpswitch must be used (only) with the hisat2 format (as described above). It WILL actually run and produce output if you feed in a VCF file directly - but my understanding is this is purely accidental :D

The hisat2_extract_snps_haplotypes_VCF.py is supposed to convert a VCF file to this format, but assumes you're using annotated data (known SNPs). To use unknown SNPs you add the switch --non-rs ... which gave me output.

I'm really not sure how to check that the output was not garbage however :D

I also don't have a confirmation about what the --snpand --haplotype switches are REALLLLY doing. I assume they are making a Hierarchical Graph FM index (HGFM) meaning the SNPs will be aligned against as well as the REF. But I haven't really been able to get confirmation of that.

I may post that as a second question.