Question: How to get VCF file into a data matrix form for machine-learning? (new to vcf files)
0
gravatar for jespinoz
2.1 years ago by
jespinoz20
jespinoz20 wrote:

Right now I am running HISAT2 on the Homo sapiens hg38 SNP db from ftp://ftp.ccb.jhu.edu/pub/infphilo/hisat2/data/grch38_snp.tar.gz which will produce 88 individual *.sam files (I have 88 samples) that I will then use to create vcf files.

Anyways, I want to get these vcf files into a form that I can use for some of my downstream pipelines. My question, is how can I get these vcf files into a (n= samples, m= SNPs) dimensional data matrix (preferably in Python or vcftools but open to others or writing my own method)? I have seen the term genotyping matrix in my Google searches, is this what I am trying to create? Apologies if this question is naive. I planned to create my own using pandas in Python but did not want to recreate the wheel if one already exists.

I'm using Python 3.6.1 on OSX.

ADD COMMENTlink written 2.1 years ago by jespinoz20

see Extracting Genotype Information From Vcf

ADD REPLYlink written 2.1 years ago by Jeremy Leipzig18k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1592 users visited in the last hour