Getting haplotype specific FASTA files from VCF files
2
1
Entering edit mode
4.4 years ago

I have been trying to get FASTA (with altered nucleotide sequences) sequences from VCF file.

I have tried using FastaAlternateReferenceMaker (from GATK) but it only gives me a single FASTA file.

Any help will be greatly appreciated. Thank you!

SNP snp genome • 4.0k views
0
Entering edit mode

This implies that you somehow phased your variants in two haplotypes?

0
Entering edit mode

The VCF file I have are phased (they have information about the presence/absence of the SNP in either of the chromosomes).

I may be wrong here or may have used incorrect terminology - I'm new to this sort of analysis and am hence asking for help.

Any help is greatly appreciated.

2
Entering edit mode

Phasing means that you determine which variants are on the same haplotype, hypothetical example with three heterozygous SNPs:

chr1: 12345 A  T 1/0 SNP1
chr1: 12358 T  G 1/0 SNP2
chr1: 13000 C  A 1/0 SNP3


It's probably quite straightforward to determine if SNP1 and SNP2 are on the same allele since some reads will span the short distance between those (assumption: technology is sequencing and no SNP array). So if a read contains both the A and G we could say that on haplotype A we have the A and G allele and on the other haplotype B we have the T and T allele of respectively SNP 1 and SNP 2. As such we could take the reference genome and create two new fasta files for this position with an A and G for haplotypeA.fasta and a T and T for haplotypeB.fasta.

However, the situation is problematic for SNP3 since no read (assuming Illumina sequencing) is going to span from SNP 1 and SNP2 to SNP3. There is no way based on this data to find out if the C allele of SNP3 is on haplotype A or haplotype B. So on which fasta do we put the C and on which fasta do we put the A allele?

Is that the same terminology for phasing you had in mind?

0
Entering edit mode

Yes, I had the same terminology for phasing in mind - but this makes it far clearer. Thanks!

As for my problem statement - I want to get altered FASTA files (like haplotypeA.fasta and haplotypeB.fasta from your example) and I have a VCF file and a reference sequence - however FastaAlternateReferenceMaker only gives me one output FASTA file.

Do you think splitting the VCF file based on the GT field will help?

Thanks again!

0
Entering edit mode

So that means you found a way to determine on which haplotype the alleles from SNP3 are?

I think FastaAlternateReferenceMaker would indeed work if you would have correctly-split vcf files, but I'm unsure if there is a "correct" way to do that.

0
Entering edit mode

I was thinking of an awk statement to split all the file with the same GT information (1|0, 0|1).

Like:

awk 'BEGIN{OFS="\t"} {if($1 ~ /^#/ ||$10 == "0\|1" || $10 == "1\|1"){print;}}' input_VCF_file > output_1.vcf awk 'BEGIN{OFS="\t"} {if($1 ~ /^#/ || $10 == "1\|0" ||$10 == "1\|1"){print;}}' input_VCF_file > output_2.vcf

1
Entering edit mode
4.4 years ago

A solution that worked for me:

I used vcf-consensus (from VCFtools) to generate the variant FASTA sequence. The Haplotype can be specified using the -H parameter (as 1 or 2). The resulting FASTA file can be then used to extract the gene of interest.

0
Entering edit mode
4.4 years ago
Fabio Marroni ★ 2.7k

It writes a fasta file in which reference SNP alleles are substituted with alternative SNP alleles. So, if your haplotypes do not coincide with the reference, it will not help you very much.

Luckily enough you can specify several intervals in which performing the task. So, you can first specify all the intervals (they can also be short, I think) in which haplotype A is coincident with the reference, and have the fasta of the haplotype B. Then, you can specify all the intervals in which the ahplotype B is equal to the reference and you will obtain the fasta of the haplotype A.

Man page is here, and the intervals can be specified with -L interval.file.name It's tricky, but should work!

0
Entering edit mode

I am looking for FASTA files for both chromosomes (listed as 0|1 etc. in the VCF file) - and the tool only gives me one.

The intervals tools is actually really helpful and I am using it to extract my gene of interest from the reference sequence.

Thanks again!

I would appreciate any advice with regrds to getting separate FASTA files.

0
Entering edit mode

Fabio gives a hint, that you could create two interval files, one corresponding to haplotypeA and the other to haplotypeB, as such running FastaAlternateReferenceMaker with each interval file will create two fasta files corresponding to both haplotypes.

0
Entering edit mode

Right! I'll try this out and post my findings here.

Thanks a bunch!