Question: Tool to fix reference allele nucleotide in vcf
2
gravatar for WouterDeCoster
5 weeks ago by
Belgium
WouterDeCoster35k wrote:

I am looking for a tool/script/pony to correct the REF column in a vcf file whenever that nucleotide doesn't match the reference genome, as supplied in fasta format.

It sounds like a common task but I could not find something. I did find bcftools +fixref but that only works for SNPs. My vcf files are from structural variants.

Cheers,
W

pony fasta vcf • 266 views
ADD COMMENTlink modified 27 days ago by finswimmer7.9k • written 5 weeks ago by WouterDeCoster35k

tagging: Pierre Lindenbaum

ADD REPLYlink written 5 weeks ago by genomax59k

My vcf files are from structural variants.

what does it currently look like in the REF/ALT column ? (why do you need to fill the REF column (thinking of a very large deletion...) in your downstream analysis ? )

ADD REPLYlink written 5 weeks ago by Pierre Lindenbaum115k

My main issue is that EVA complains that the nucleotides are incorrect. It seems their software only uses POS to check the REF nucleotide, and not END.

Just some lines from the vcf:

chr1    258046  93      G       <INS>   161     MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=258047;CIPOS=-10,1;CIEND=-10,1;SVLEN=126;RT=0,8,0;GAP=126;MAPQ=60,60;PID=61.386,1.698;PLENGTH=0.036,1.185;R
chr1    297535  96      A       <INS>   96      MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=297536;CIPOS=-1,1;CIEND=-1,1;SVLEN=68;RT=0,2,0;GAP=68;MAPQ=60,60;PID=6.857,456.896;PLENGTH=0.409,0.005;RLEN
chr1    350805  98      A       <INS>   649     MapQual IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=350806;CIPOS=-1,3;CIEND=-1,3;SVLEN=40;RT=0,11,0;GAP=40;MAPQ=36,36;PID=33.595,1.201;PLENGTH=0.094,2.548;RLEN
chr1    368841  107     G       <INS>   35      SVcluster;MapQual       IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=368842;CIPOS=-2,1;CIEND=-2,1;SVLEN=43;RT=0,4,0;GAP=43;MAPQ=42,42;PID=1.298,27.038;PLENGTH=0
chr1    369462  133     C       <INS>   131     SVcluster;MapQual;CIPOS;CIEND   IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369463;CIPOS=-14,19;CIEND=-14,19;SVLEN=72;RT=0,6,0;GAP=72;MAPQ=56,56;PID=7.453,8.49
chr1    369520  135     A       <INS>   65      SVcluster;MapQual       IMPRECISE;SVTYPE=INS;SVMETHOD=NanoSV_v1.2.0;END=369521;CIPOS=-4,4;CIEND=-4,4;SVLEN=42;RT=0,5,0;GAP=42;MAPQ=35,35;PID=23.347,20.085;PLENGTH=

Essentially a single nucleotide would be sufficient - and in one file 98% is correct. The other file has N for every position so some more changes need to be made.

I'm already working on a python script (now posted as answer) but I thought someone else should have done it already ;p

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by WouterDeCoster35k

For your narrow purpose, where all your insertions are symbolic rather than explicitly spelled out, the existing answers should be a satisfactory workaround to "bcftools +fixref"'s insistence on SNP-only data.

However, it's worth noting that there is a very good reason that bcftools insists on only SNPs. It's actually impossible to fix ordinary short indels: the only difference between a 1-base deletion and an insertion of the same base at the same position is the REF/ALT order.

ADD REPLYlink written 5 weeks ago by chrchang5234.3k
4
gravatar for JC
5 weeks ago by
JC7.0k
Mexico
JC7.0k wrote:

Some code like this can help, but be aware that changing the reference column affects other fields in your VCF, such as the GT value.

#!/usr/bin/perl

use strict;
use warnings;

$ARGV[2] or die "use fixRefVCF.pl <GENOME_FASTA> <VCF> <FIX_VCF>\n";

my $genome = shift @ARGV;
my $orig_vcf = shift @ARGV;
my $new_vcf = shift @ARGV;

my %seq = ();
my ($chr, $pos, $ref, $obs);

warn "loading genome sequence from $genome\n";
open (my $fh, "<", $genome) or die "cannot read $genome\n";
while (<$fh>) {
    chomp;
    if (/>(\w+)/) {
        $chr = $1;
    } else {
       $seq{$chr} .= $_;
    }
}
close $fh;

open (my $vh, "<", $orig_vcf) or die "cannot read $orig_vcf\n";
open (my $oh, ">", $new_vcf) or die "cannot write $new_vcf\n";
while (<$vh>) {
    if (/^#/) {
         print $oh $_;
    } else {
        chomp;
        my @var = split (/\t/, $_);
        $chr = $var[0];
        $pos = $var[1]; # VCF is 1-based, Perl string is 0-based
        $ref = $var[3];
        $obs = substr($seq{$chr}, $pos - 1, length $ref);
        if ($ref ne $obs) {
            warn "editing $chr:$pos $ref->$obs\n";
            $var[3] = $obs;
        }
        print $oh join "\t", @var;
        print $oh "\n";
    }
}
close $vh;
close $oh;
ADD COMMENTlink modified 5 weeks ago • written 5 weeks ago by JC7.0k
1

Thanks! That seems to work beautifully well (and fast too).
Two comments: the warn "editing $chr:$pos $ref->$obs\n"; is off-by-one and the header is not written to the new vcf file ;-)

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by WouterDeCoster35k
1

glad to help, I edited the code to show the VCF coordinates and print the header.

ADD REPLYlink written 5 weeks ago by JC7.0k
3
gravatar for finswimmer
27 days ago by
finswimmer7.9k
Germany
finswimmer7.9k wrote:

Hello WouterDeCoster ,

there is another bcftools plugin that should do this:

$ bcftools +fill-from-fasta input.vcf -- -c REF -f genome.fa > output.vcf

fin swimmer

ADD COMMENTlink modified 27 days ago • written 27 days ago by finswimmer7.9k
2
gravatar for WouterDeCoster
5 weeks ago by
Belgium
WouterDeCoster35k wrote:

This is the python solution I came up with, using cyvcf (minimally requiring v0.10.2) and pyfaidx. It can use compressed vcf and fasta files.

ADD COMMENTlink written 5 weeks ago by WouterDeCoster35k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1253 users visited in the last hour