Question: Tool to fix reference allele nucleotide in vcf
2
gravatar for WouterDeCoster
8 months ago by
Belgium
WouterDeCoster40k wrote:

I am looking for a tool/script/pony to correct the REF column in a vcf file whenever that nucleotide doesn't match the reference genome, as supplied in fasta format.

It sounds like a common task but I could not find something. I did find bcftools +fixref but that only works for SNPs. My vcf files are from structural variants.

Cheers,
W

pony fasta vcf • 673 views
ADD COMMENTlink modified 8 months ago by finswimmer11k • written 8 months ago by WouterDeCoster40k

tagging: Pierre Lindenbaum

ADD REPLYlink written 8 months ago by genomax69k

My vcf files are from structural variants.

what does it currently look like in the REF/ALT column ? (why do you need to fill the REF column (thinking of a very large deletion...) in your downstream analysis ? )

ADD REPLYlink written 8 months ago by Pierre Lindenbaum121k

My main issue is that EVA complains that the nucleotides are incorrect. It seems their software only uses POS to check the REF nucleotide, and not END.

Just some lines from the vcf:

chr1    258046  93      G       <INS>   161     MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=258047;CIPOS=-10,1;CIEND=-10,1;SVLEN=126;RT=0,8,0;GAP=126;MAPQ=60,60;PID=61.386,1.698;PLENGTH=0.036,1.185;R
chr1    297535  96      A       <INS>   96      MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=297536;CIPOS=-1,1;CIEND=-1,1;SVLEN=68;RT=0,2,0;GAP=68;MAPQ=60,60;PID=6.857,456.896;PLENGTH=0.409,0.005;RLEN
chr1    350805  98      A       <INS>   649     MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=350806;CIPOS=-1,3;CIEND=-1,3;SVLEN=40;RT=0,11,0;GAP=40;MAPQ=36,36;PID=33.595,1.201;PLENGTH=0.094,2.548;RLEN
chr1    368841  107     G       <INS>   35      SVcluster;MapQual       IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=368842;CIPOS=-2,1;CIEND=-2,1;SVLEN=43;RT=0,4,0;GAP=43;MAPQ=42,42;PID=1.298,27.038;PLENGTH=0
chr1    369462  133     C       <INS>   131     SVcluster;MapQual;CIPOS;CIEND   IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369463;CIPOS=-14,19;CIEND=-14,19;SVLEN=72;RT=0,6,0;GAP=72;MAPQ=56,56;PID=7.453,8.49
chr1    369520  135     A       <INS>   65      SVcluster;MapQual       IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369521;CIPOS=-4,4;CIEND=-4,4;SVLEN=42;RT=0,5,0;GAP=42;MAPQ=35,35;PID=23.347,20.085;PLENGTH=

Essentially a single nucleotide would be sufficient - and in one file 98% is correct. The other file has N for every position so some more changes need to be made.

I'm already working on a python script (now posted as answer) but I thought someone else should have done it already ;p

ADD REPLYlink modified 8 months ago • written 8 months ago by WouterDeCoster40k

For your narrow purpose, where all your insertions are symbolic rather than explicitly spelled out, the existing answers should be a satisfactory workaround to "bcftools +fixref"'s insistence on SNP-only data.

However, it's worth noting that there is a very good reason that bcftools insists on only SNPs. It's actually impossible to fix ordinary short indels: the only difference between a 1-base deletion and an insertion of the same base at the same position is the REF/ALT order.

ADD REPLYlink written 8 months ago by chrchang5235.4k
5
gravatar for JC
8 months ago by
JC8.0k
Mexico
JC8.0k wrote:

Some code like this can help, but be aware that changing the reference column affects other fields in your VCF, such as the GT value.

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use fixRefVCF.pl <GENOME_FASTA> <VCF> <FIX_VCF>\n";

my $genome = shift @ARGV;
my $orig_vcf = shift @ARGV;
my $new_vcf = shift @ARGV;

my %seq = ();
my ($chr, $pos, $ref, $obs);

warn "loading genome sequence from $genome\n";
open (my $fh, "<", $genome) or die "cannot read $genome\n";
while (<$fh>) {
    chomp;
    if (/>(\w+)/) {
        $chr = $1;
    } else {
       $seq{$chr} .= $_;
    }
}
close $fh;

open (my $vh, "<", $orig_vcf) or die "cannot read $orig_vcf\n";
open (my $oh, ">", $new_vcf) or die "cannot write $new_vcf\n";
while (<$vh>) {
    if (/^#/) {
         print $oh $_;
    } else {
        chomp;
        my @var = split (/\t/, $_);
        $chr = $var[0];
        $pos = $var[1]; # VCF is 1-based, Perl string is 0-based
        $ref = $var[3];
        $obs = substr($seq{$chr}, $pos - 1, length $ref);
        if ($ref ne $obs) {
            warn "editing $chr:$pos $ref->$obs\n";
            $var[3] = $obs;
        }
        print $oh join "\t", @var;
        print $oh "\n";
    }
}
close $vh;
close $oh;
ADD COMMENTlink modified 8 months ago • written 8 months ago by JC8.0k
1

Thanks! That seems to work beautifully well (and fast too).
Two comments: the warn "editing $chr:$pos $ref->$obs\n"; is off-by-one and the header is not written to the new vcf file ;-)

ADD REPLYlink modified 8 months ago • written 8 months ago by WouterDeCoster40k
1

glad to help, I edited the code to show the VCF coordinates and print the header.

ADD REPLYlink written 8 months ago by JC8.0k

This script is nice to identify the variants whose REF allele is not correct and should be revised. It should be not used to change the reference alllele forcely with this script.

ADD REPLYlink written 28 days ago by Shicheng Guo7.6k
1

Ehm yes it should change those. That is exactly what I asked for.

ADD REPLYlink written 28 days ago by WouterDeCoster40k

So next step is apply plink --a1-allele to change the reference allele?

ADD REPLYlink written 28 days ago by Shicheng Guo7.6k
1

I did not use or need plink. I needed to modify the reference allele in a vcf file.

ADD REPLYlink written 28 days ago by WouterDeCoster40k

So what's the approach for you to modify the reference allele in a vcf. I find there are ~12% SNPs are REF/ALT reversed.

ADD REPLYlink written 28 days ago by Shicheng Guo7.6k

THIS SCRIPT FIXES THE REFERENCE ALLELES in a vcf file which are not correct when compared with a fasta file.

ADD REPLYlink written 28 days ago by WouterDeCoster40k
4
gravatar for finswimmer
8 months ago by
finswimmer11k
Germany
finswimmer11k wrote:

Hello WouterDeCoster ,

there is another bcftools plugin that should do this:

$ bcftools +fill-from-fasta input.vcf -- -c REF -f genome.fa > output.vcf

fin swimmer

ADD COMMENTlink modified 8 months ago • written 8 months ago by finswimmer11k

How to deal with the problem:

No functional bcftools plugins were found. The environment variable BCFTOOLS_PLUGINS is not set.

ADD REPLYlink written 28 days ago by Shicheng Guo7.6k
  1. Are you asking a question or making a statement?
  2. If you're asking a question,
    • what have you tried by yourself to solve it?
    • why are you asking it here as a comment to someone's answer?
  3. If you're making a statement, what "problem" are you addressing?
ADD REPLYlink modified 28 days ago • written 28 days ago by RamRS22k
2
gravatar for WouterDeCoster
8 months ago by
Belgium
WouterDeCoster40k wrote:

This is the python solution I came up with, using cyvcf (minimally requiring v0.10.2) and pyfaidx. It can use compressed vcf and fasta files.

ADD COMMENTlink written 8 months ago by WouterDeCoster40k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 857 users visited in the last hour