Question: Tool to fix reference allele nucleotide in vcf
2
gravatar for WouterDeCoster
19 months ago by
Belgium
WouterDeCoster43k wrote:

I am looking for a tool/script/pony to correct the REF column in a vcf file whenever that nucleotide doesn't match the reference genome, as supplied in fasta format.

It sounds like a common task but I could not find something. I did find bcftools +fixref but that only works for SNPs. My vcf files are from structural variants.

Cheers,
W

pony fasta vcf • 1.3k views
ADD COMMENTlink modified 18 months ago by finswimmer13k • written 19 months ago by WouterDeCoster43k

tagging: Pierre Lindenbaum

ADD REPLYlink written 19 months ago by genomax83k

My vcf files are from structural variants.

what does it currently look like in the REF/ALT column ? (why do you need to fill the REF column (thinking of a very large deletion...) in your downstream analysis ? )

ADD REPLYlink written 19 months ago by Pierre Lindenbaum128k

My main issue is that EVA complains that the nucleotides are incorrect. It seems their software only uses POS to check the REF nucleotide, and not END.

Just some lines from the vcf:

chr1    258046  93      G       <INS>   161     MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=258047;CIPOS=-10,1;CIEND=-10,1;SVLEN=126;RT=0,8,0;GAP=126;MAPQ=60,60;PID=61.386,1.698;PLENGTH=0.036,1.185;R
chr1    297535  96      A       <INS>   96      MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=297536;CIPOS=-1,1;CIEND=-1,1;SVLEN=68;RT=0,2,0;GAP=68;MAPQ=60,60;PID=6.857,456.896;PLENGTH=0.409,0.005;RLEN
chr1    350805  98      A       <INS>   649     MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=350806;CIPOS=-1,3;CIEND=-1,3;SVLEN=40;RT=0,11,0;GAP=40;MAPQ=36,36;PID=33.595,1.201;PLENGTH=0.094,2.548;RLEN
chr1    368841  107     G       <INS>   35      SVcluster;MapQual       IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=368842;CIPOS=-2,1;CIEND=-2,1;SVLEN=43;RT=0,4,0;GAP=43;MAPQ=42,42;PID=1.298,27.038;PLENGTH=0
chr1    369462  133     C       <INS>   131     SVcluster;MapQual;CIPOS;CIEND   IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369463;CIPOS=-14,19;CIEND=-14,19;SVLEN=72;RT=0,6,0;GAP=72;MAPQ=56,56;PID=7.453,8.49
chr1    369520  135     A       <INS>   65      SVcluster;MapQual       IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369521;CIPOS=-4,4;CIEND=-4,4;SVLEN=42;RT=0,5,0;GAP=42;MAPQ=35,35;PID=23.347,20.085;PLENGTH=

Essentially a single nucleotide would be sufficient - and in one file 98% is correct. The other file has N for every position so some more changes need to be made.

I'm already working on a python script (now posted as answer) but I thought someone else should have done it already ;p

ADD REPLYlink modified 19 months ago • written 19 months ago by WouterDeCoster43k

For your narrow purpose, where all your insertions are symbolic rather than explicitly spelled out, the existing answers should be a satisfactory workaround to "bcftools +fixref"'s insistence on SNP-only data.

However, it's worth noting that there is a very good reason that bcftools insists on only SNPs. It's actually impossible to fix ordinary short indels: the only difference between a 1-base deletion and an insertion of the same base at the same position is the REF/ALT order.

ADD REPLYlink written 19 months ago by chrchang5236.9k
5
gravatar for JC
19 months ago by
JC10k
Mexico
JC10k wrote:

Some code like this can help, but be aware that changing the reference column affects other fields in your VCF, such as the GT value.

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use fixRefVCF.pl <GENOME_FASTA> <VCF> <FIX_VCF>\n";

my $genome = shift @ARGV;
my $orig_vcf = shift @ARGV;
my $new_vcf = shift @ARGV;

my %seq = ();
my ($chr, $pos, $ref, $obs);

warn "loading genome sequence from $genome\n";
open (my $fh, "<", $genome) or die "cannot read $genome\n";
while (<$fh>) {
    chomp;
    if (/>(\w+)/) {
        $chr = $1;
    } else {
       $seq{$chr} .= $_;
    }
}
close $fh;

open (my $vh, "<", $orig_vcf) or die "cannot read $orig_vcf\n";
open (my $oh, ">", $new_vcf) or die "cannot write $new_vcf\n";
while (<$vh>) {
    if (/^#/) {
         print $oh $_;
    } else {
        chomp;
        my @var = split (/\t/, $_);
        $chr = $var[0];
        $pos = $var[1]; # VCF is 1-based, Perl string is 0-based
        $ref = $var[3];
        $obs = substr($seq{$chr}, $pos - 1, length $ref);
        if ($ref ne $obs) {
            warn "editing $chr:$pos $ref->$obs\n";
            $var[3] = $obs;
        }
        print $oh join "\t", @var;
        print $oh "\n";
    }
}
close $vh;
close $oh;
ADD COMMENTlink modified 19 months ago • written 19 months ago by JC10k
1

Thanks! That seems to work beautifully well (and fast too).
Two comments: the warn "editing $chr:$pos $ref->$obs\n"; is off-by-one and the header is not written to the new vcf file ;-)

ADD REPLYlink modified 19 months ago • written 19 months ago by WouterDeCoster43k
1

glad to help, I edited the code to show the VCF coordinates and print the header.

ADD REPLYlink written 19 months ago by JC10k

This script is nice to identify the variants whose REF allele is not correct and should be revised. It should be not used to change the reference alllele forcely with this script.

ADD REPLYlink written 11 months ago by Shicheng Guo8.2k
1

Ehm yes it should change those. That is exactly what I asked for.

ADD REPLYlink written 11 months ago by WouterDeCoster43k

So next step is apply plink --a1-allele to change the reference allele?

ADD REPLYlink written 11 months ago by Shicheng Guo8.2k
1

I did not use or need plink. I needed to modify the reference allele in a vcf file.

ADD REPLYlink written 11 months ago by WouterDeCoster43k

So what's the approach for you to modify the reference allele in a vcf. I find there are ~12% SNPs are REF/ALT reversed.

ADD REPLYlink written 11 months ago by Shicheng Guo8.2k

THIS SCRIPT FIXES THE REFERENCE ALLELES in a vcf file which are not correct when compared with a fasta file.

ADD REPLYlink written 11 months ago by WouterDeCoster43k
4
gravatar for finswimmer
18 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

Hello WouterDeCoster ,

there is another bcftools plugin that should do this:

$ bcftools +fill-from-fasta input.vcf -- -c REF -f genome.fa > output.vcf

fin swimmer

ADD COMMENTlink modified 18 months ago • written 18 months ago by finswimmer13k

How to deal with the problem:

No functional bcftools plugins were found. The environment variable BCFTOOLS_PLUGINS is not set.

ADD REPLYlink written 11 months ago by Shicheng Guo8.2k
  1. Are you asking a question or making a statement?
  2. If you're asking a question,
    • what have you tried by yourself to solve it?
    • why are you asking it here as a comment to someone's answer?
  3. If you're making a statement, what "problem" are you addressing?
ADD REPLYlink modified 11 months ago • written 11 months ago by RamRS27k
2
gravatar for WouterDeCoster
19 months ago by
Belgium
WouterDeCoster43k wrote:

This is the python solution I came up with, using cyvcf (minimally requiring v0.10.2) and pyfaidx. It can use compressed vcf and fasta files.

ADD COMMENTlink written 19 months ago by WouterDeCoster43k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1905 users visited in the last hour