Passing VCF file directly into ```HiSAT2-build --snp --haplotype```
1
0
Entering edit mode
6.4 years ago
rightmirem ▴ 70

We've been working with HiSAT2and HiSAT2-build for mapping NGS data to a reference.

In the next step, we wish to create a new indexed reference from VCF data we've generated.

My PI says to use...

hisat2-build --snp ./MySNPs.vcf.gz  --haplotype <something> Grch38.fa OutfilesIndexed

... to generate ...

OutfilesIndexed.1.ht2
OutfilesIndexed.2.ht2
etc.

The question is, based on what I see in the manual, I'm not sure I can use VCF format directly ...

--snp
Provide a list of SNPs (in the HISAT2's own format) as follows (five columns).

SNP ID <tab> snp type (single, deletion, or insertion) <tab> chromosome name <tab> zero-offset based genomic position of a SNP <tab> alternative base (single), the length of SNP (deletion), or insertion sequence (insertion)

For example, rs58784443 single 13 18447947 T

Use hisat2_extract_snps_haplotypes_UCSC.py

...although it DOES create files when I pipe a VCF in directly...but I don't know if they are reliable.

hisat2-build --snp ./MySNPs.vcf.gz   Grch38.fa  IDXXX

-rw-r--r-- 1 user grpname      8192 Nov 28 11:23 IDXXX.1.ht2
-rw-r--r-- 1 user grpname         0 Nov 28 11:23 IDXXX.2.ht2
-rw-r--r-- 1 user grpname     11294 Nov 28 11:23 IDXXX.3.ht2
-rw-r--r-- 1 user grpname 736462267 Nov 28 11:23 IDXXX.4.ht2
-rw-r--r-- 1 user grpname         0 Nov 28 11:23 IDXXX.7.ht2
-rw-r--r-- 1 user grpname         0 Nov 28 11:23 IDXXX.8.ht2

My understanding from here, is that I should use hisat2_extract_snps_haplotypes_VCF.py to convert the VCF to the HiSAT2 format above. But when I run the VCF file through it I get nothing...

hisat2_extract_snps_haplotypes_VCF.py Grch38.fa  ./MySNPs.vcf.gz  IDXXX

-rw-r--r-- 1 user grpname          0 Nov 28 14:53 IDXXX.haplotype
-rw-r--r-- 1 user grpname          0 Nov 28 14:53 IDXXX.snp
SNP snp next-gen alignment hisat2 • 3.3k views
ADD COMMENT
2
Entering edit mode
6.4 years ago
rightmirem ▴ 70

So, found my own answer. I'll leave this here in case the question comes up again.

The --snpswitch must be used (only) with the hisat2 format (as described above). It WILL actually run and produce output if you feed in a VCF file directly - but my understanding is this is purely accidental :D

The hisat2_extract_snps_haplotypes_VCF.py is supposed to convert a VCF file to this format, but assumes you're using annotated data (known SNPs). To use unknown SNPs you add the switch --non-rs ... which gave me output.

I'm really not sure how to check that the output was not garbage however :D

I also don't have a confirmation about what the --snpand --haplotype switches are REALLLLY doing. I assume they are making a Hierarchical Graph FM index (HGFM) meaning the SNPs will be aligned against as well as the REF. But I haven't really been able to get confirmation of that.

I may post that as a second question.

ADD COMMENT

Login before adding your answer.

Traffic: 2004 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6