Question

How to convert vcf file into data matrix for machine-learning?

0

Entering edit mode

7.0 years ago

jespinoz ▴ 20

I have a 1.1 TB uncompress VCF file generated from Ensembl hg38 build 84 ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz with HISAT2 and samtools.

Originally I had 88 samples (individual vcf files) that I merged and filtered with the following command to remove indels and only to get SNPs with both alleles:

bcftools merge -l list_of_vcfs.txt | bcftools view -I -m 2 > snponly.out.vcf

I realized that my current contains scaffolds that aren't a part of the chromosome that I'm thinking about removing but that is besides the point (just wanted to mention that in case anyone is wondering about the size of the VCF file).

How can I convert the VCF to a datamatrix with samples as rows and columns as attributes?

I tried using plink

plink --recode --vcf snponly.out.vcf --out snponly.out

to get a matrix in the form of: ID snp1_a1 snp2_a2 snp2_a1 snp2_a2 ... snpN_a1 snpN_a2

but it didn't like the "_" in my IDs which looks like the following: ./sorted_bam_files/1054.1_RD1.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted and I believe it also said the file was too big

$ cat plink_vcf.e8514098

Warning: 2939537867 variant records had no GT field. Error: PLINK does not support more than 2^31 - 3 variants. We recommend other software, such as PLINK/SEQ, for very deep studies of small numbers of genomes.

Is plinkseq the way to go? If so, what specific command do I run to convert the VCF into a sample/attribute data matrix?

Is it possible that I'm lacking important fields? If so, how do I get them onto my VCF file?
enter image description here

vcf bcf vcftools genotype machine-learning • 5.2k views

ADD COMMENT • link 6.9 years ago by jespinoz ▴ 20

score 2 · Answer 1 · 2017-05-23

2

Entering edit mode

7.0 years ago

John Ma ▴ 310

Use GATK's VariantsToTable, which can just output a single field per sample.

ADD COMMENT • link 7.0 years ago by John Ma ▴ 310

0

Entering edit mode

Which parameters do I use to get something analagous to ID snp1_a1 snp2_a2 snp2_a1 snp2_a2 ... snpN_a1 snpN_a2? Also, I couldn't get this tool to work $ java -jar GenomeAnalysisTK.jar --help Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/gatk/engine/CommandLineGATK : Unsupported major.minor version 52.0

ADD REPLY • link 7.0 years ago by jespinoz ▴ 20

0

Entering edit mode

Do you just want to output the ref and the first-alt alleles, or something else? As for your error, check the JDK version and the GATK versions.

ADD REPLY • link 7.0 years ago by John Ma ▴ 310

0

Entering edit mode

I got GATK to run. I am trying to convert it into a genotype matrix.

ADD REPLY • link 6.9 years ago by jespinoz ▴ 20

0

Entering edit mode

java -jar GenomeAnalysisTK.jar -T VariantsToTable -V [VCF file] -R [reference genome with fai and dict indices] -SMA -F ID -F REF -F ALT -o [output file]

The -SMA here stands for "split Multi-allelic" to ensure every alt field only has 1 allele. If you're sure every variation in your file only has one alt, you can skip it.

Quoting GATK's tooldocs:

By default, records with multiple ALT alleles will comprise just one line of output; note that in general this can make your resulting file unreadable/malformed for certain tools like R, as the representation of multi-allelic INFO field values are often comma-separated lists of values. Using the flag will cause multi-allelic records to be split into multiple lines of output (one for each allele in the ALT field); INFO field values that are not lists are copied for each of the output records while only the appropriate entry is used for lists.

ADD REPLY • link 6.9 years ago by John Ma ▴ 310