Question: SnpMatrix from VCF file
0
gravatar for Dhana
21 months ago by
Dhana70
Helsinki, Finland
Dhana70 wrote:

Hi,

I am planning on doing eQTL analysis using the MatrixEQTL R (MatrixEQTL link) package and I need to extract genotype information and create a SnpMatrix file for that like the one given here SNP.txt file link. I have tried vcftools to create a similar looking file but to no avail, I have tried

  • vcf-to-tab < genotypes.vcf > requiredFormat.txt

  • vcftools --vcf genotypes.vcf --extract-FORMAT-info GT --out requiredFormat.txt

Is there any other way to extract genotype information from the VCF file and convert it to that format?

Thanks

Description of the SnpMatrix file - In VCF files, 0 represents the reference allele and integers greater than 0 represent the alternate alleles (i.e., 2, 3, 4 would indicate the 2nd, 3rd or 4th allele in the ALT field for a particular variant). This function only supports variants with a single alternate allele and therefore the alternate values will always be 1. Genotypes are stored in the SnpMatrix as 0, 1, 2 or 3 where 0 = missing, 1 = "0/0", 2 = "0/1" or "1/0" and 3 = "1/1". In SnpMatrix terminology, "A" is the reference allele and "B" is the risk allele. Equivalent statements to those made with 0 and 1 allele values would be 0 = missing, 1 = "A/A", 2 = "A/B" or "B/A" and 3 = "B/B".

ADD COMMENTlink modified 9 months ago by Leandro Lima900 • written 21 months ago by Dhana70

Hi, I have the same problem - I am attempting to use a vcf file to create a snp matrix formatted for analysis with MatrixEQTL.

I was not able to follow the solution below and I am not familiar with javascript. Could further details be provided for this solution?

Are there complimentary approaches using plink, vcftools, python, R or some combination thereof? Thanks

ADD REPLYlink written 14 months ago by wa3j0

I was not able to follow the solution below

what's wrong ? https://meta.stackexchange.com/questions/147616/

ADD REPLYlink written 14 months ago by Pierre Lindenbaum115k

I am not familiar with using javascript and am unclear as to how to implement the script posted. Here is a solution in R that I believe may be simpler:

library(dplyr)
library(VariantAnnotation)
library(snpStats)
row_na_cut = 0.2
fname = "sub_maf_filtered.vcf.gz"
tab <- TabixFile(fname, yieldSize=500000)
param <- ScanVcfParam(fixed="ALT", geno=c("GT", "GL"))
open(tab)
geno_list = c()
while (nrow(vcf_yield <- readVcf(tab, "hg19", param=param)))
  mat <- genotypeToSnpMatrix(vcf_yield)
  nums0 = t(as(mat$genotype, "numeric")) %>% as.data.frame
  row_rem_NA = nums0[rowSumsis.na(nums0)) < row_na_cut*ncol(nums0),]
  if(nrow(row_rem_NA)*ncol(row_rem_NA) > 0){geno_list = rbind( geno_list, row_rem_NA )
}
close(tab)
ADD REPLYlink modified 14 months ago • written 14 months ago by wa3j0

the interpreter is a java program named bioalcidae. And it is invoked using

java -jar dist/bioalcidae.jar -f your_script.js input.vcf
ADD REPLYlink written 14 months ago by Pierre Lindenbaum115k
4
gravatar for Pierre Lindenbaum
21 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum115k wrote:

using bioalcidae and the following script: http://lindenb.github.io/jvarkit/BioAlcidae.html

var snp_id=0;
var samples = header.getSampleNamesInOrder();
out.print("id");
for(var i=0;i< samples.size();++i)
    {
    out.print("\t"+samples.get(i));
    }
out.println();

while(iter.hasNext()) {
 var ctx= iter.next();
 ++snp_id;
 out.print(new java.lang.Integer(snp_id));
 for(var i=0;i< samples.size();++i)
    {
    out.print("\t");
    var g= ctx.getGenotype(samples.get(i));
    if(g.isHomRef())
        {
        out.print("1");
        }
    else if(g.isHet())
        {
        out.print("2");
        }
    else if(g.isHomVar())
        {
        out.print("3");
        }
    else 
        {
        out.print("0");
        }
    }

 out.println();
 }
ADD COMMENTlink modified 14 months ago • written 21 months ago by Pierre Lindenbaum115k

Thank you very much Pierre. This was the solution I was looking for.

ADD REPLYlink modified 21 months ago • written 21 months ago by Dhana70

Hi Pierre, I used the Bioalcidae tool and the provided script for generating the matrix, I wanted to ask is it possible to replace the first field (ID's) of the resulting matrix to the respective variant ID's from the vcf file? or is using the tools such as cut or awk to replace the ids is the only option.

ADD REPLYlink written 21 months ago by Dhana70
 out.print( ctx.hasID()?ctx.getID():".");
ADD REPLYlink written 21 months ago by Pierre Lindenbaum115k
var snp_id=0;
var samples = header.getSampleNamesInOrder();
out.print("id");
for(var i=0;i< samples.size();++i)
    {
    out.print("\t"+samples.get(i));
    }
out.println();

while(iter.hasNext()) {
var ctx= iter.next();
++snp_id;
out.print( ctx.hasID()?ctx.getID():".");
for(var i=0;i< samples.size();++i)
    {
    out.print("\t");
    var g= ctx.getGenotype(samples.get(i));
    if(g.isHomRef())
        {
        out.print("1");
        }
    else if(g.isHet())
        {
        out.print("2");
        }
    else if(g.isHomVar())
        {
        out.print("3");
        }
    else 
        {
        out.print("0");
        }
    }

  out.println();
 }
  

For others who might need it. Thanks Pierre.

ADD REPLYlink modified 21 months ago • written 21 months ago by Dhana70

Hi@Pierre Lindenbaum. I am also facing with the same problem as @Dhana. I have a filtered vcf file and I want to create something exactly similar to SNP.txt file that @Dhana showed above. Does this script has been created for this purpose? I could not get very well what is the input format of @Dhana's file!

ADD REPLYlink written 15 months ago by Ana120
1
gravatar for Leandro Lima
9 months ago by
Leandro Lima900
San Francisco, CA
Leandro Lima900 wrote:

This should also work (I'm assuming that everything that is not 0/0, 0/1 or 1/1 should be considered as 0):

vcf=genotypes.vcf

vcftools --vcf $vcf --extract-FORMAT-info GT --out requiredFormat

# Creating SNP.txt
awk '{$1=$1":"$2; $2=""; print $0}' requiredFormat.GT.FORMAT | perl -pe 's!0/0!1!g; s!0/1!2!g; s!1/1!2!g; s!./.!0!g; s/CHROM:POS/id/g' > SNP.txt

# Creating snpsloc.txt
echo -e "snp\tchr\tpos" > snpsloc.txt
awk 'NR > 1 {id=$1":"$2; print id"\tchr"$1"\t"$2}' requiredFormat.GT.FORMAT  >> snpsloc.txt
ADD COMMENTlink written 9 months ago by Leandro Lima900
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1026 users visited in the last hour