Changing Strand Orientation In A Vcf File
1
2
Entering edit mode
9.6 years ago
sagi.polani ▴ 100

Hi everyone,

I'm looking for a way to change the strand orientation in a VCF file. Specifically, I'm looking to change the strand orientation of a dbSNP VCF file so that all variants will be in the + strand orientation (genome orientation) and NOT according to refSNP (transcript orientation).

I'd appreciate any help on this matter.

Thanks!

Sagi

vcf strand • 5.3k views
1
Entering edit mode

I may be off here, but I thought the dbSNP VCF file is in genome orientation. Take rs63750315 as an example:

1    6529504    rs63750315    A    G    .    .    RS=63750315;RSPOS=6529504;RV;....


The A here is the reference sequence on chr1 at position 6529504. I think the RV tag simply means the original submission was in reverse orientation to the genome.

Perhaps I am missing something obvious ...

0
Entering edit mode

Strange. Can you show us a sample of your vcf ?

1
Entering edit mode

In the VCF file you won't find anything that flags the variant as being on the (+) or (-) strand. The variants are all enlisted in VCF format according to the transcript orientation, i.e. refSNP. In GVF format however, you will se a flag:

22 dbSNP SNV 24057817 24057817 . + . ID=119882;Variant_seq=A;Dbxref=dbSNP_130:rs69257152;Reference_seq=T 22 dbSNP SNV 24057823 24057823 . + . ID=119883;Variant_seq=A;Dbxref=dbSNP_130:rs69257153;Reference_seq=T 22 dbSNP SNV 24057971 24057971 . + . ID=119884;Variant_seq=G;Dbxref=dbSNP_130:rs69257159;Reference_seq=A 22 dbSNP SNV 24058008 24058008 . + . ID=119885;Variant_seq=G;Dbxref=dbSNP_130:rs69257160;Reference_seq=A 22 dbSNP SNV 24058561 24058561 . - . ID=119886;Variant_seq=A;Dbxref=dbSNP_130:rs69259883;Reference_seq=T 22 dbSNP SNV 24058588 24058588 . - . ID=119887;Variant_seq=C;Dbxref=dbSNP_130:rs69259882;Reference_seq=A 22 dbSNP SNV 24058615 24058615 . - . ID=119888;Variant_seq=A;Dbxref=dbSNP_130:rs69259881;Reference_seq=G

Thanks!

Sagi

0
Entering edit mode

are we talking about the gzipped VCF file that dbSNP releases, or a custom VCF result you can retrieve from dbSNP? in case it's the former, there's a VCF header that states that the "RV" code is used to flag those SNPs indeed. check my answer below for further details.

1
Entering edit mode
9.6 years ago

now that dbSNP is consistently releasing a VCF file with all their variants, using it as a reference is very useful, and the idea of "making all alleles be referred to forward strand" is not at all a bad idea at all. in fact I'm going to study this for a little while, because I'll may be able to use it for a project of my own. thank you for pointing it out.

this is a perl script I've worked out so far that it would look for "RV" codes through the gzipped dbSNP VCF file, which is how these "reversed SNPs" are encoded according to the VCF headers, and translating their alleles to the opposite strand:

#!/usr/bin/perl -w

# perl packages needed
use strict;
use Compress::Zlib;

# usage
sub usage()
{
print STDERR << "EOF";

usage: perl $0 -r -i infile > outfile processes a gzipped dbSNP file looking for "RV" code lines and reversing their bases download it from ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606/VCF/00-All.vcf.gz by Jorge Amigo Lechuga OPTIONS -i gzipped input file -r remove "RV" codes -h this (help) message EOF exit; } # get options use Getopt::Std; my %opt; my$opt_string = 'i:rh';
getopts( "$opt_string", \%opt ) or usage(); usage() if$opt{h};
my $inputfile =$opt{i} ? $opt{i}: usage(); my$rvremoval = $opt{r} ? 1 : 0; # define the base translations my %trans = qw(A T T A C G G C); my$check = join '|', keys %trans;

# process the gzipped dbSNP file
my $gz = gzopen($inputfile, "rb");
while ( $gz->gzreadline(my$line) > 0 ) {
if ($line =~ /RV;/) { my @cols = split /\t/,$line;
# reverse allele columns
$cols[3] =~ s/($check)/$trans{$1}/g;
$cols[4] =~ s/($check)/$trans{$1}/g;
# remove "RV" code
$cols[7] =~ s/RV;//g if$rvremoval;
print join "\t", @cols;
} else {
print $line; } }$gz->gzclose();

0
Entering edit mode

Hi Jorge,

Thanks for this!

Bevertheless, this script will work for human dbSNP files, but I fear it won't work for non-human dbSNP files in VCF format, since they don't include an "RV" flag or any other in that matter to designate the (-) strand. Do you have a similar script for processing GVF files? These files simply mark the reverse strand by (-)

Thank you very much!

Sagi Thanks

0
Entering edit mode

even easier. you can change the whole script for this one:

#!/usr/bin/perl -w
use strict;
my %trans = qw(A T T A C G G C);
my $check = join '|', keys %trans; while (<>) { if (/\.\s-\s\./) { s/seq=($check)/seq=$trans{$1}/g;
s/\.(\s)-(\s)\./\.$1+$2\./g;
}
print;
}


and call it simply like this:

perl script.pl <infile >outfile


where "infile" is your input file name and "outfile" is the desired output file name. you could change

s/\.(\s)-(\s)\./\.$1+$2\./g;


to this

s/\.\t-\t\./\.\t+\t2\./g;


if the delimiters are tabs, and it will perform a little bit faster, but this I can't tell by just reading your input file example.