Question: Changing the reference haplotype in VCF
0
gravatar for Michael Kosicki
3 months ago by
Wellcome Trust Sanger Institute
Michael Kosicki80 wrote:

A VCF file describes how to translate the reference genome into (one or more) alternative haplotypes. I need a VCF file that translates one of these haplotypes back into the reference. In other words, I need to "invert" the VCF file I have.

If only SNPs were present, one would simply swap reference and alternative haplotype columns. Indels make it necessary to update the position. Of course, larger structural variants would also change the coordinates (sometimes dramatically), but for now I only care about SNPs and indels.

It does not seem too difficult to code, but would be quite time consuming (for me) to do it right.

Is there a tool for it?

snp alignment indel vcf • 311 views
ADD COMMENTlink modified 3 months ago by Kevin Blighe21k • written 3 months ago by Michael Kosicki80
1
gravatar for Michael Kosicki
3 months ago by
Wellcome Trust Sanger Institute
Michael Kosicki80 wrote:

Ok, I think I've found the solution myself:

## get mouse chr19
curl ftp://ftp.ensembl.org/pub/release-91/fasta/mus_musculus/dna/Mus_musculus.GRCm38.dna.chromosome.19.fa.gz > chr19.fa.gz
zcat chr19.fa.gz > chr19.fa

## get BALBc SNPs
curl ftp://ftp-mouse.sanger.ac.uk/current_snps/strain_specific_vcfs/BALB_cJ.mgp.v5.indels.dbSNP142.normed.vcf.gz > dir.raw.vcf.gz
zcat dir.raw.vcf.gz > dir.raw.vcf

## save header
grep '^#' dir.raw.vcf > dir.vcf

## collapse to first haplotype (phased or not)
grep -v '^#' dir.raw.vcf | sed 's/,[GTCA]*//' > dir.hap.vcf

## clean up (ie turn CCCA>CA to CCA>A etc)
cat dir.hap.vcf | awk 'BEGIN{OFS="\t"} \
{ if(length($4)>length($5)) {min=length($5)} else { min=length($4) } } \
{ print $1,$2+min-1,$3,substr($4,min),substr($5,min),$6,$7,$8,$9,$10 }' \
> dir.hap.clean.vcf

## resort
sort -k1,1 -k2,2n dir.hap.clean.vcf > dir.hap.sort.vcf

## remove overlapping indels
grep -v '^#' dir.hap.sort.vcf | awk 'BEGIN{pos=0; chr=1} \
{ if($1!=chr) {chr=$1; pos=0} } \
{if($2>pos) { print $0; pos=$2-(length($5)-length($4)) } }' \
>> dir.vcf

## zip and index
bgzip -f dir.vcf && tabix -p vcf dir.vcf.gz

## convert ref to alt
bcftools consensus -f chr19.fa dir.vcf.gz > chr19.alt.fa

## invert vcf
zcat dir.vcf.gz | grep '^#' > alt.vcf &&     # get header
zcat dir.vcf.gz | grep -v '^#' | awk 'BEGIN{OFS="\t"; chr=1; i=0} \
{ if($1!=chr) {chr=$1; i=0} } \
{print $1,$2+i,$3,$5,$4,$6,$7,$8,$9,$10; i=i+length($5)-length($4)}' \
>> alt.vcf

## zip and index
bgzip -f alt.vcf && tabix -p vcf alt.vcf.gz

## convert alt to ref
bcftools consensus -f chr19.alt.fa alt.vcf.gz > chr19.dir2.fa

## check if orignal restored, no difference
cmp --silent chr19.dir2.fa chr19.fa && echo "Files are identical." || echo "Files are different."

This solutions deals with (i.e. removes) alternative haplotypes and overlapping indels (i.e. alternative haplotypes in practice). Of course, one might want to handle these differently, e.g. mask the offending (likely heterozygous) regions completely.

ADD COMMENTlink modified 3 months ago • written 3 months ago by Michael Kosicki80

Looks good - if you are certain that it does what you need, then I can move it to an Answer, and then you can Accept it. Let me know.

ADD REPLYlink written 3 months ago by Kevin Blighe21k

Sure, go ahead, thanks.

ADD REPLYlink written 3 months ago by Michael Kosicki80

Done - you should be able to Accept the answer now

ADD REPLYlink written 3 months ago by Kevin Blighe21k
0
gravatar for grant.hovhannisyan
3 months ago by
grant.hovhannisyan880 wrote:

I guess you can use FastaAlternateReferenceMaker from GATK, but other way around.

ADD COMMENTlink written 3 months ago by grant.hovhannisyan880

Could you elaborate? This tool just "merges" reference fasta and a vcf file, and does not produce a new vcf file, as far as I can tell.

ADD REPLYlink written 3 months ago by Michael Kosicki80
0
gravatar for Kevin Blighe
3 months ago by
Kevin Blighe21k
University College London Cancer Institute
Kevin Blighe21k wrote:

Here is a solution but using coding, and tested on a 1000 Genomes BCF file that i have. You can see that it functions correctly based on the fact that I had previously set the ID field to unique values based on CHROM:POS:REF:ALT:

bcftools view 1000Genomes.Norm.bcf | awk '/^#/ {print $0}; \
!/^#/ {{if (length($4)==1 && length($5)==1) {printf $1"\t"$2"\t"$3"\t"$5"\t"$4"\t"$6"\t"$7"\t"$8"\t"$9"\t"} \
else {printf $1"\t"$2"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"} \
for (i=10; i<=NF; i++) {printf $(i)"\t"; if (i==NF) {printf "\n"}}}}' - \
| head -85 | cut -f 1-9


##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##filedate=20141112
##source="simplfy-vcf (r1211)"
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##contig=<ID=1>
##bcftools_normVersion=1.3.1+htslib-1.3.1
##bcftools_normCommand=norm -Ou -m-any 1000Genomes/chr1.1kg.phase3.v5.vcf.gz
##bcftools_normCommand=norm -Ou -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta
##bcftools_annotateVersion=1.3.1+htslib-1.3.1
##bcftools_annotateCommand=annotate -Ob -I +%ID
##contig=<ID=2>
##contig=<ID=3>
##contig=<ID=4>
##contig=<ID=5>
##contig=<ID=6>
##contig=<ID=7>
##contig=<ID=8>
##contig=<ID=9>
##contig=<ID=10>
##contig=<ID=11>
##contig=<ID=12>
##contig=<ID=13>
##contig=<ID=14>
##contig=<ID=15>
##contig=<ID=16>
##contig=<ID=17>
##contig=<ID=18>
##contig=<ID=19>
##contig=<ID=20>
##contig=<ID=21>
##contig=<ID=22>
##contig=<ID=X>
##bcftools_concatVersion=1.3.1+htslib-1.3.1
##bcftools_concatCommand=concat -Ob chr1.1kg.phase3.v5.bcf chr2.1kg.phase3.v5.bcf chr3.1kg.phase3.v5.bcf chr4.1kg.phase3.v5.bcf chr5.1kg.phase3.v5.bcf chr6.1kg.phase3.v5.bcf chr7.1kg.phase3.v5.bcf chr8.1kg.phase3.v5.bcf chr9.1kg.phase3.v5.bcf chr10.1kg.phase3.v5.bcf chr11.1kg.phase3.v5.bcf chr12.1kg.phase3.v5.bcf chr13.1kg.phase3.v5.bcf chr14.1kg.phase3.v5.bcf chr15.1kg.phase3.v5.bcf chr16.1kg.phase3.v5.bcf chr17.1kg.phase3.v5.bcf chr18.1kg.phase3.v5.bcf chr19.1kg.phase3.v5.bcf chr20.1kg.phase3.v5.bcf chr21.1kg.phase3.v5.bcf chr22.1kg.phase3.v5.bcf chrX.1kg.phase3.v5.bcf
##bcftools_normCommand=norm -Ou -m-any 1000Genomes.bcf
##bcftools_annotateCommand=annotate -Ob -x ID -I +%CHROM:%POS:%POS:%REF:%ALT
##bcftools_viewVersion=1.2+htslib-1.2.1
##bcftools_viewCommand=view /Kev/CollegeWork/12TheClinicalBioinformaticsCompanyLtd2015-/Projects/AvellinoLabUSA/2/1000Genomes/1000Genomes.Norm.bcf
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT
1   10177   1:10177:10177:A:AC  A   AC  .   PASS    .   GT
1   10235   1:10235:10235:T:TA  T   TA  .   PASS    .   GT
1   10352   1:10352:10352:T:TA  T   TA  .   PASS    .   GT
1   10616   1:10616:10616:CCGCCGTTGCAAAGGCGCGCCG:C  CCGCCGTTGCAAAGGCGCGCCG  C   .   PASS    .   GT
1   10642   1:10642:10642:G:A   A   G   .   PASS    .   GT
1   11008   1:11008:11008:C:G   G   C   .   PASS    .   GT
1   11012   1:11012:11012:C:G   G   C   .   PASS    .   GT
1   11063   1:11063:11063:T:G   G   T   .   PASS    .   GT
1   13110   1:13110:13110:G:A   A   G   .   PASS    .   GT
1   13116   1:13116:13116:T:G   G   T   .   PASS    .   GT
1   13118   1:13118:13118:A:G   G   A   .   PASS    .   GT
1   13273   1:13273:13273:G:C   C   G   .   PASS    .   GT
1   13284   1:13284:13284:G:A   A   G   .   PASS    .   GT
1   13380   1:13380:13380:C:G   G   C   .   PASS    .   GT
1   13483   1:13483:13483:G:C   C   G   .   PASS    .   GT
1   13494   1:13494:13494:A:G   G   A   .   PASS    .   GT
1   13550   1:13550:13550:G:A   A   G   .   PASS    .   GT
1   14464   1:14464:14464:A:T   T   A   .   PASS    .   GT
1   14599   1:14599:14599:T:A   A   T   .   PASS    .   GT
1   14604   1:14604:14604:A:G   G   A   .   PASS    .   GT
1   14930   1:14930:14930:A:G   G   A   .   PASS    .   GT
1   14933   1:14933:14933:G:A   A   G   .   PASS    .   GT
1   15211   1:15211:15211:T:G   G   T   .   PASS    .   GT
1   15245   1:15245:15245:C:T   T   C   .   PASS    .   GT
1   15274   1:15274:15274:A:G   G   A   .   PASS    .   GT
1   15274   1:15274:15274:A:T   T   A   .   PASS    .   GT
1   15585   1:15585:15585:G:A   A   G   .   PASS    .   GT
1   15644   1:15644:15644:G:A   A   G   .   PASS    .   GT
1   15774   1:15774:15774:G:A   A   G   .   PASS    .   GT
1   15777   1:15777:15777:A:G   G   A   .   PASS    .   GT
1   15790   1:15790:15790:G:A   A   G   .   PASS    .   GT
1   15820   1:15820:15820:G:T   T   G   .   PASS    .   GT
1   15903   1:15903:15903:G:GC  G   GC  .   PASS    .   GT
1   16071   1:16071:16071:G:A   A   G   .   PASS    .   GT
1   16141   1:16141:16141:C:T   T   C   .   PASS    .   GT
1   16142   1:16142:16142:G:A   A   G   .   PASS    .   GT
1   16515   1:16515:16515:A:C   C   A   .   PASS    .   GT
1   16542   1:16542:16542:C:A   A   C   .   PASS    .   GT
1   16949   1:16949:16949:A:C   C   A   .   PASS    .   GT
1   18643   1:18643:18643:G:A   A   G   .   PASS    .   GT
1   18849   1:18849:18849:C:G   G   C   .   PASS    .   GT
1   19391   1:19391:19391:G:A   A   G   .   PASS    .   GT
1   28914   1:28914:28914:G:A   A   G   .   PASS    .   GT
1   30923   1:30923:30923:G:T   T   G   .   PASS    .   GT
1   46285   1:46285:46285:ATAT:A    ATAT    A   .   PASS    .   GT
ADD COMMENTlink modified 3 months ago • written 3 months ago by Kevin Blighe21k

I need to also "invert" the indels, this script seems to ignore them and only "invert" the SNPs.

ADD REPLYlink written 3 months ago by Michael Kosicki80

Yes, I wrote it to just flip the SNPs / SNVs. You had mentioned how, by flipping indels, the position would change. I'm actually looking at it right now and I believe that the position stays the same for indels. This code will do it:

bcftools view 1000Genomes.Norm.bcf | awk '/^#/ {print $0}; \
!/^#/ {{if (length($4)==1 && length($5)==1) {printf $1"\t"$2"\t"$3"\t"$5"\t"$4"\t"$6"\t"$7"\t"$8"\t"$9"\t"} \
else {printf $1"\t"$2"\t"$3"\t"$5"\t"$4"\t"$6"\t"$7"\t"$8"\t"$9"\t"} \
for (i=10; i<=NF; i++) {printf $(i)"\t"; if (i==NF) {printf "\n"}}}}' - \
| head -85 | cut -f 1-9

...or, even shorter:

bcftools view 1000Genomes.Norm.bcf | awk '/^#/ {print $0}; \
!/^#/ {{printf $1"\t"$2"\t"$3"\t"$5"\t"$4"\t"$6"\t"$7"\t"$8"\t"$9"\t"} \
for (i=10; i<=NF; i++) {printf $(i)"\t"; if (i==NF) {printf "\n"}}}' - \
| head -85 | cut -f 1-9
ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe21k

I don't get it. Consider ATC>ATGCT with indels 2 T TG and 3 C CT. "Inverse" indels are 2 TG T and 4 CT C - second one shifted due to the first one "adding" sequence.

ADD REPLYlink written 3 months ago by Michael Kosicki80

It is for those cases that you should always normalise your VCF variants by left-aligning indels and splitting multi-allelic calls into multiple lines. This helps to drastically reduce ambiguity. Even the publicly-available 1000 Genomes data is not normalised.

This can easily be done with:

bcftools norm -Ov -m-any MyVariants.vcf > MyVariants.Norm.vcf

When you do this and you flip the REF and ALT alleles, you do not need to subsequently change the POS field. Please test it by loading the different variant positions in the UCSC Genome Browser from my example above. If you additionally wanted to record the end-position of the indel, which is not recorded as standard in VCF format, then that would of course change.

Edit: if you look at the example that I posted above, I normalised that 1000 Genomes BCF file - see entries 8 and 9 in the header:

##bcftools_normCommand=norm -Ou -m-any 1000Genomes/chr1.1kg.phase3.v5.vcf.gz
##bcftools_normCommand=norm -Ou -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta
ADD REPLYlink modified 3 months ago • written 3 months ago by Kevin Blighe21k

Can you please confirm whether this has helped or not by up-voting and/or accepting (also for the post of grant.hovhannisyan)? We are volunteers here and give up free time to help others. We appreciate acknowledgements and, more importantly, it's good practice to show others with the same problem the solutions / tips that worked.

Thanks, Kevin

ADD REPLYlink written 3 months ago by Kevin Blighe21k

I think I might not have made myself clear.

Consider a fasta reference "ref1" and a vcf file "vcf1" such that replacing all the variants in "ref1" with ones in "vcf1" gives me an alternative reference "ref2", which may have more or less nucleotides than "ref1" depending on the types of indels in "vcf1". What I need is such a vcf file "vcf2", that when I replace all the variants in "ref2" with ones in "vcf2" I would get "ref1" again. In pseudo-mathematic form: ref1+vcf1=ref2, ref2+vcf2=ref1.

Another example (same as ATC>ATGCT, but more dramatic): let's say there's two variants in vcf1: a 1000nt insertion and a SNV on the next position. Ref2 would be 1000nt longer than ref1. If I want to convert ref2 to ref1 again, the SNV entry would have to have a position 1002, with respect to the insertion. No matter, which way one aligns the indels, that's what it's going to be.

And yes, I've checked, with normalization or without (bcftools norm does not modify my vcf file, because it's already normed) and it does not work. Applying vcf-convert to ref2 (obtained with vcf-convert on ref1 and vcf1) and vcf2 (obtained with your script, which does swap column 4 and 5 as you say) yields an error of "the ref does not agree at...", as expected.

ADD REPLYlink written 3 months ago by Michael Kosicki80
2

If you have only SNPs, then just swap the REF and ALT column in initial vcf file. The same with indels, but you have to recalculate coordinates in fasta and vcf (here you have to code, don't know any tools for that)

ADD REPLYlink written 3 months ago by grant.hovhannisyan880
1

Hi Michael, thanks for getting back. That does indeed seem like a complex and very specific problem! It's not impossible but I think that you may have to go the coding route.

A quick search through Google reveals possible leads:

*the final link is what Grant had suggested

Kevin

ADD REPLYlink written 3 months ago by Kevin Blighe21k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1360 users visited in the last hour