Question: Changing the order of ref/alt alleles in a phased vcf file
gravatar for yorgos.athanasiadis
3.7 years ago by
yorgos.athanasiadis40 wrote:


I have a vcf file with phased data ("dataset1"), which I want to analyse together with some other genotype data ("dataset2").

For some loci in dataset1, the ref/alt alleles are opposite to those in dataset2, i.e. for a given SNP I get A/G and G/A respectively.

I have to questions:

Is there any way I can either

(i) check quickly the ref/alt consistency across all my loci in the two datasets and ideally remove all inconsistent positions? would the --diff-site-discordance flag from vcftools perform something like that?


(ii) swop the ref/alt information for the SNPs of my choice directly on the vcf files? I want to avoid converting to plink because I don't want to lose the phase information

Any ideas will be very much appreciated.


ADD COMMENTlink modified 3.7 years ago by Nicola Casiraghi440 • written 3.7 years ago by yorgos.athanasiadis40
gravatar for Nicola Casiraghi
3.7 years ago by
Trento, IT
Nicola Casiraghi440 wrote:


(i) Hi, yes option --diff-site-discordance calculates discordance on a site by site basis. 

(ii) Test this piece of R code. For each common loci (same chromosome and same position) shared between dataset 1 and 2, it tests consistency between reference and alternative alleles. If not, it swop ref and alt in dataset 2 for that locus.


# import datasets
vcf1 <- "path/to/dataset1.vcf" # this will be used as reference
vcf2 <- "path/to/dataset2.vcf" # this will be swopped 
# check ref/alt for common loci
dataset1 <- read.table(vcf1,stringsAsFactors = F)
dataset2 <- read.table(vcf2,stringsAsFactors = F)
common_loci = merge(dataset1,dataset2,by = c("V1","V2"),all=F,suffixes = c("_dataset1","_dataset2"))
dataset2_swop <- dataset2
for(id in 1:nrow(common_loci)){
  chr = common_loci$V1[id]
  pos = common_loci$V2[id]
  ref = common_loci$V4_dataset1[id]
  alt = common_loci$V5_dataset1[id]
  locus_index = which(dataset2$V1 == chr & dataset2$V2 == pos)
  locus = dataset2[locus_index,]
  if(locus$V4 == alt & locus$V5 == ref){
    # print loci with ref/alt swopped between datasets
    cat("inconsistent locus:",paste(locus$V1,locus$V2))
    dataset2_swop[locus_index,4] <- ref 
    dataset2_swop[locus_index,5] <- alt
# print tab delimited output
write.table(dataset2_swop,file = "dataset2.swop.vcf",quote=F,col.names = F,row.names=F,sep="\t")


ADD COMMENTlink written 3.7 years ago by Nicola Casiraghi440
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 633 users visited in the last hour