Question: Changing the reference haplotype in VCF
0
gravatar for Michael Kosicki
6 months ago by
Wellcome Trust Sanger Institute
Michael Kosicki80 wrote:

A VCF file describes how to translate the reference genome into (one or more) alternative haplotypes. I need a VCF file that translates one of these haplotypes back into the reference. In other words, I need to "invert" the VCF file I have.

If only SNPs were present, one would simply swap reference and alternative haplotype columns. Indels make it necessary to update the position. Of course, larger structural variants would also change the coordinates (sometimes dramatically), but for now I only care about SNPs and indels.

It does not seem too difficult to code, but would be quite time consuming (for me) to do it right.

Is there a tool for it?

snp alignment indel vcf • 430 views
ADD COMMENTlink modified 6 months ago by Kevin Blighe28k • written 6 months ago by Michael Kosicki80

I guess you can use FastaAlternateReferenceMaker from GATK, but other way around.

ADD REPLYlink written 6 months ago by grant.hovhannisyan1.1k

Could you elaborate? This tool just "merges" reference fasta and a vcf file, and does not produce a new vcf file, as far as I can tell.

ADD REPLYlink written 6 months ago by Michael Kosicki80
1
gravatar for Michael Kosicki
6 months ago by
Wellcome Trust Sanger Institute
Michael Kosicki80 wrote:

Ok, I think I've found the solution myself:

## get mouse chr19
curl ftp://ftp.ensembl.org/pub/release-91/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.19.fa.gz > chr19.fa.gz
zcat chr19.fa.gz > chr19.fa

## get BALBc SNPs
curl ftp://ftp-mouse.sanger.ac.uk/current_snps/strain_specific_vcfs/BALB_cJ.mgp.v5.indels.dbSNP142.normed.vcf.gz > dir.raw.vcf.gz
zcat dir.raw.vcf.gz > dir.raw.vcf

## save header
grep '^#' dir.raw.vcf > dir.vcf

## collapse to first haplotype (phased or not)
grep -v '^#' dir.raw.vcf | sed 's/,[GTCA]*//' > dir.hap.vcf

## clean up (ie turn CCCA>CA to CCA>A etc)
cat dir.hap.vcf | awk 'BEGIN{OFS="\t"} \
{ if(length($4)>length($5)) {min=length($5)} else { min=length($4) } } \
{ print $1,$2+min-1,$3,substr($4,min),substr($5,min),$6,$7,$8,$9,$10 }' \
> dir.hap.clean.vcf

## resort
sort -k1,1 -k2,2n dir.hap.clean.vcf > dir.hap.sort.vcf

## remove overlapping indels
grep -v '^#' dir.hap.sort.vcf | awk 'BEGIN{pos=0; chr=1} \
{ if($1!=chr) {chr=$1; pos=0} } \
{if($2>pos) { print $0; pos=$2-(length($5)-length($4)) } }' \
>> dir.vcf

## zip and index
bgzip -f dir.vcf && tabix -p vcf dir.vcf.gz

## convert ref to alt
bcftools consensus -f chr19.fa dir.vcf.gz > chr19.alt.fa

## invert vcf
zcat dir.vcf.gz | grep '^#' > alt.vcf &&     # get header
zcat dir.vcf.gz | grep -v '^#' | awk 'BEGIN{OFS="\t"; chr=1; i=0} \
{ if($1!=chr) {chr=$1; i=0} } \
{print $1,$2+i,$3,$5,$4,$6,$7,$8,$9,$10; i=i+length($5)-length($4)}' \
>> alt.vcf

## zip and index
bgzip -f alt.vcf && tabix -p vcf alt.vcf.gz

## convert alt to ref
bcftools consensus -f chr19.alt.fa alt.vcf.gz > chr19.dir2.fa

## check if orignal restored, no difference
cmp --silent chr19.dir2.fa chr19.fa && echo "Files are identical." || echo "Files are different."

This solutions deals with (i.e. removes) alternative haplotypes and overlapping indels (i.e. alternative haplotypes in practice). Of course, one might want to handle these differently, e.g. mask the offending (likely heterozygous) regions completely.

ADD COMMENTlink modified 6 months ago • written 6 months ago by Michael Kosicki80

Looks good - if you are certain that it does what you need, then I can move it to an Answer, and then you can Accept it. Let me know.

ADD REPLYlink written 6 months ago by Kevin Blighe28k

Sure, go ahead, thanks.

ADD REPLYlink written 6 months ago by Michael Kosicki80

Done - you should be able to Accept the answer now

ADD REPLYlink written 6 months ago by Kevin Blighe28k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1020 users visited in the last hour