Question: Getting haplotype specific FASTA files from VCF files
1
gravatar for Nitro_Shade
3.0 years ago by
Nitro_Shade20
Nitro_Shade20 wrote:

I have been trying to get FASTA (with altered nucleotide sequences) sequences from VCF file.

I have tried using FastaAlternateReferenceMaker (from GATK) but it only gives me a single FASTA file.

Any help will be greatly appreciated. Thank you!

snp genome • 2.7k views
ADD COMMENTlink modified 3.0 years ago • written 3.0 years ago by Nitro_Shade20

This implies that you somehow phased your variants in two haplotypes?

ADD REPLYlink written 3.0 years ago by WouterDeCoster42k

The VCF file I have are phased (they have information about the presence/absence of the SNP in either of the chromosomes).

I may be wrong here or may have used incorrect terminology - I'm new to this sort of analysis and am hence asking for help.

Any help is greatly appreciated.

ADD REPLYlink written 3.0 years ago by Nitro_Shade20
2

Phasing means that you determine which variants are on the same haplotype, hypothetical example with three heterozygous SNPs:

chr1: 12345 A  T 1/0 SNP1
chr1: 12358 T  G 1/0 SNP2
chr1: 13000 C  A 1/0 SNP3

It's probably quite straightforward to determine if SNP1 and SNP2 are on the same allele since some reads will span the short distance between those (assumption: technology is sequencing and no SNP array). So if a read contains both the A and G we could say that on haplotype A we have the A and G allele and on the other haplotype B we have the T and T allele of respectively SNP 1 and SNP 2. As such we could take the reference genome and create two new fasta files for this position with an A and G for haplotypeA.fasta and a T and T for haplotypeB.fasta.

However, the situation is problematic for SNP3 since no read (assuming Illumina sequencing) is going to span from SNP 1 and SNP2 to SNP3. There is no way based on this data to find out if the C allele of SNP3 is on haplotype A or haplotype B. So on which fasta do we put the C and on which fasta do we put the A allele?

Is that the same terminology for phasing you had in mind?

ADD REPLYlink written 3.0 years ago by WouterDeCoster42k

Thank you for your detailed reply.

Yes, I had the same terminology for phasing in mind - but this makes it far clearer. Thanks!

As for my problem statement - I want to get altered FASTA files (like haplotypeA.fasta and haplotypeB.fasta from your example) and I have a VCF file and a reference sequence - however FastaAlternateReferenceMaker only gives me one output FASTA file.

Do you think splitting the VCF file based on the GT field will help?

Thanks again!

ADD REPLYlink written 3.0 years ago by Nitro_Shade20

So that means you found a way to determine on which haplotype the alleles from SNP3 are?

I think FastaAlternateReferenceMaker would indeed work if you would have correctly-split vcf files, but I'm unsure if there is a "correct" way to do that.

ADD REPLYlink written 3.0 years ago by WouterDeCoster42k

I was thinking of an awk statement to split all the file with the same GT information (1|0, 0|1).

Like:

awk 'BEGIN{OFS="\t"} {if($1 ~ /^#/ || $10 == "0\|1" || $10 == "1\|1"){print;}}' input_VCF_file > output_1.vcf
awk 'BEGIN{OFS="\t"} {if($1 ~ /^#/ || $10 == "1\|0" || $10 == "1\|1"){print;}}' input_VCF_file > output_2.vcf
ADD REPLYlink written 3.0 years ago by Nitro_Shade20
1
gravatar for Nitro_Shade
3.0 years ago by
Nitro_Shade20
Nitro_Shade20 wrote:

A solution that worked for me:

I used vcf-consensus (from VCFtools) to generate the variant FASTA sequence. The Haplotype can be specified using the -H parameter (as 1 or 2). The resulting FASTA file can be then used to extract the gene of interest.

ADD COMMENTlink written 3.0 years ago by Nitro_Shade20
0
gravatar for Fabio Marroni
3.0 years ago by
Fabio Marroni2.4k
Italy
Fabio Marroni2.4k wrote:

FastaAlternateReferenceMaker performs one simple task.

It writes a fasta file in which reference SNP alleles are substituted with alternative SNP alleles. So, if your haplotypes do not coincide with the reference, it will not help you very much.

Luckily enough you can specify several intervals in which performing the task. So, you can first specify all the intervals (they can also be short, I think) in which haplotype A is coincident with the reference, and have the fasta of the haplotype B. Then, you can specify all the intervals in which the ahplotype B is equal to the reference and you will obtain the fasta of the haplotype A.

Man page is here, and the intervals can be specified with -L interval.file.name It's tricky, but should work!

ADD COMMENTlink written 3.0 years ago by Fabio Marroni2.4k

Thank you for replying.

I am looking for FASTA files for both chromosomes (listed as 0|1 etc. in the VCF file) - and the tool only gives me one.

The intervals tools is actually really helpful and I am using it to extract my gene of interest from the reference sequence.

Thanks again!

I would appreciate any advice with regrds to getting separate FASTA files.

ADD REPLYlink written 3.0 years ago by Nitro_Shade20

Fabio gives a hint, that you could create two interval files, one corresponding to haplotypeA and the other to haplotypeB, as such running FastaAlternateReferenceMaker with each interval file will create two fasta files corresponding to both haplotypes.

ADD REPLYlink written 3.0 years ago by WouterDeCoster42k

Right! I'll try this out and post my findings here.

Thanks a bunch!

ADD REPLYlink written 3.0 years ago by Nitro_Shade20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1783 users visited in the last hour