Question

Converting VCF file to table format

0

Entering edit mode

21 months ago

anikcropscience ▴ 230

Hello,

I am trying to perform a selection scan analysis following the scripts from https://github.com/ngarud/SelectionHapStats

According to the guidelines, I have converted my VCF files for each chromosome into a text file separately. I used the following R-codes for file conversion:

library(pegas)
library(ape)
vcf <- read.vcf("file.vcf", from=1, to=3000000)
vcf.t <- t(as.matrix(vcf))
# transform homozygous sites
vcf.t <- sub("A/A", "A", vcf.t, perl=TRUE)
vcf.t <- sub("T/T", "T", vcf.t, perl=TRUE)
vcf.t <- sub("C/C", "C", vcf.t, perl=TRUE)
vcf.t <- sub("G/G", "G", vcf.t, perl=TRUE)
# transform missing sites
vcf.t <- sub("\\./\\.", "N", vcf.t, perl=TRUE)
# transform heterozygous sites
vcf.t <- sub("[A/T/G/C]/[A/T/G/C]", ".", vcf.t, perl=TRUE)
# get positions
pos <- system(paste("grep -v '#'", "file.vcf", "| awk '{print$2}'", sep=" "),
              intern=T)
rownames(vcf.t) <- pos
write.table(vcf.t, "chr12", quote=F, sep=",", col.names=F)

After this conversion, I ran the script for the selection analysis and it threw me an error. Later, I checked the converted file and it looks like this:

8304,C|T,N,C,N,T|T,T|T,C,C|T,C,C,C|T,C,C,C,C,C,C,N,C,N,N,C,C,C,C,N,C,N,C,N,C,C,N,N,N,C,C,N,C,C,N,C,C,N,C,N,N,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,C,N,C,C,C,C,C,C,C,C,C,C,C,C,C,N,C,N,N,C,N,N,N,N,N,C,N,C,C,C,C,C,C,C,N,C,C
8325,C,N,C,C,C,C,C,C,T|T,N,C,N,N,N,N,C,C,N,C,N,N,C,C,C,C,N,C,C,C,C,C,.,C,N,N,C,C|T,N,N,N,N,C,C,N,C,N,N,N,N,T,N,N,C,N,C,C,C,N,C,C,C,C,C,N,C,C,C,.,.,.,C,C,C,N,C,C,C,N,C,C,N,C,N,N,N,N,N,C,N,C,C,N,C,C,C,C,N,C,C

The problem here is the | symbol for the heterozygous SNPs. I have transformed those in my R-script but still, some persist. The selection scan script does not recognize this | symbol and throws an error.

Can you please suggest to me how to correct these | symbols and obtain a clean output? My VCF file is unphased.

Thank you.

SNP bash VCF R genome • 1.4k views

ADD COMMENT • link 20 months ago by anikcropscience ▴ 230

1

Entering edit mode

Switch to gsub from sub where appropriate. Also, what is this regex supposed to do - it seems unnecessarily convoluted: "[A/T/G/C]/[A/T/G/C]"

Following up, I really think you should be doing this using bcftools and other shell utilities, not R. Even with R, your code is quite inefficient. If you're using regex, do it efficiently: sub("([ATGC])[/]\\1)", "\\1", vcf.t, perl=TRUE) or sub("A/A", "A", vcf.t, fixed=TRUE); sub("G/G", "G", vcf.t, fixed=TRUE);.... Don't use perl regex unless you need it, especially when you're replacing fixed strings.

On second thought, stop using R and use grep -E/grep -F on the shell.

ADD REPLY • link 21 months ago by Ram 43k

0

Entering edit mode

Ok, thank you very much for your comments and suggestions. I will do it accordingly and see if it works.

ADD REPLY • link 21 months ago by anikcropscience ▴ 230

0

Entering edit mode

Hi Ram, I ran this code as you mentioned. But when I use the sub("([ATGC])[/]\\1)", "\\1", vcf.t, perl=TRUE) I get an error mentioning incorrect use of Regular Expression. Moreover, the same problem still persists where I have quite a few | in the output file. Although I have found another solution using a bash script to convert the data file, it would be nice to get a solution for this code.

ADD REPLY • link 20 months ago by anikcropscience ▴ 230

1

Entering edit mode

Oops, I made a mistake with the regex (easy enough to correct if the error message was read properly - there's an extra ) at the end of the regex. See corrected and tested expression below:

> gsub("([ATGC])[/]\\1)", "\\1", "A/A", perl=TRUE)
Error in gsub("([ATGC])[/]\\1)", "\\1", "A/A", perl = TRUE) : 
  invalid regular expression '([ATGC])[/]\1)'
In addition: Warning message:
In gsub("([ATGC])[/]\\1)", "\\1", "A/A", perl = TRUE) :
  PCRE pattern compilation error
    'unmatched closing parenthesis'
    at ')'

> sub("([ATGC])[/]\\1", "\\1", "A/A", perl=TRUE)
[1] "A"
> sub("([ATGC])[/]\\1", "\\1", c("A/A", "G/G", "A/G"), perl=TRUE)
[1] "A"   "G"   "A/G"

ADD REPLY • link 20 months ago by Ram 43k

0

Entering edit mode

Oh, I am embarrassed right now. Sorry for overlooking this tiny issue. However, now the original problem escalated. Previously, I just had the pipe | symbol in my output file. But after using vcf.t <- gsub("([ATGC])[/]\\1", "\\1", vcf.t, perl=TRUE) I have both the pipe | and / in the output file.

ADD REPLY • link 20 months ago by anikcropscience ▴ 230

0

Entering edit mode

I think there's a mix up between your requirements and your expectations of the code. The regex we worked on will only reduce the 4 lines in your "transform homozygous sites" part of the code to a single line. It does not address missing GTs or heterozygous sites, let alone multi-allelic sites.

The | character is quite odd. Are you sure your VCF is unphased? Can you check if there are any phased GT calls and also if you have any multi-allelic sites?

Also, don't use gsub in the regex above, use sub. You know you only want to replace the first [ATGC]/[ATGC] with [ATGC], using gsub might result in bugs where the data is unpredictable (multi-allelic sites, for example).

ADD REPLY • link 20 months ago by Ram 43k

0

Entering edit mode

Hi Ram, finally it worked. Indeed, there were some phased genotypes in my VCF. So, I converted all the | into /. Then the code worked fine. Thanks a lot for your solutions.

ADD REPLY • link 20 months ago by anikcropscience ▴ 230