Question: How to convert vcf file into data matrix for machine-learning?
0
gravatar for jespinoz
23 months ago by
jespinoz20
jespinoz20 wrote:

I have a 1.1 TB uncompress VCF file generated from Ensembl hg38 build 84 ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz with HISAT2 and samtools.

Originally I had 88 samples (individual vcf files) that I merged and filtered with the following command to remove indels and only to get SNPs with both alleles:

bcftools merge -l list_of_vcfs.txt | bcftools view -I -m 2 > snponly.out.vcf

I realized that my current contains scaffolds that aren't a part of the chromosome that I'm thinking about removing but that is besides the point (just wanted to mention that in case anyone is wondering about the size of the VCF file).

How can I convert the VCF to a datamatrix with samples as rows and columns as attributes?

I tried using plink

plink --recode --vcf snponly.out.vcf --out snponly.out

to get a matrix in the form of: ID snp1_a1 snp2_a2 snp2_a1 snp2_a2 ... snpN_a1 snpN_a2

but it didn't like the "_" in my IDs which looks like the following: ./sorted_bam_files/1054.1_RD1.kneaddata.paired.human.bowtie2.R1-R2.sam.bam.sorted and I believe it also said the file was too big

$ cat plink_vcf.e8514098

Warning: 2939537867 variant records had no GT field. Error: PLINK does not support more than 2^31 - 3 variants. We recommend other software, such as PLINK/SEQ, for very deep studies of small numbers of genomes.

Is plinkseq the way to go? If so, what specific command do I run to convert the VCF into a sample/attribute data matrix?

Is it possible that I'm lacking important fields? If so, how do I get them onto my VCF file?
enter image description here

ADD COMMENTlink modified 22 months ago • written 23 months ago by jespinoz20
2
gravatar for MMa
23 months ago by
MMa260
MMa260 wrote:

Use GATK's VariantsToTable, which can just output a single field per sample.

ADD COMMENTlink written 23 months ago by MMa260

Which parameters do I use to get something analagous to ID snp1_a1 snp2_a2 snp2_a1 snp2_a2 ... snpN_a1 snpN_a2? Also, I couldn't get this tool to work $ java -jar GenomeAnalysisTK.jar --help Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/gatk/engine/CommandLineGATK : Unsupported major.minor version 52.0

ADD REPLYlink modified 23 months ago • written 23 months ago by jespinoz20

Do you just want to output the ref and the first-alt alleles, or something else? As for your error, check the JDK version and the GATK versions.

ADD REPLYlink written 23 months ago by MMa260

I got GATK to run. I am trying to convert it into a genotype matrix.

ADD REPLYlink written 23 months ago by jespinoz20

java -jar GenomeAnalysisTK.jar -T VariantsToTable -V [VCF file] -R [reference genome with fai and dict indices] -SMA -F ID -F REF -F ALT -o [output file]

The -SMA here stands for "split Multi-allelic" to ensure every alt field only has 1 allele. If you're sure every variation in your file only has one alt, you can skip it.

Quoting GATK's tooldocs:

By default, records with multiple ALT alleles will comprise just one line of output; note that in general this can make your resulting file unreadable/malformed for certain tools like R, as the representation of multi-allelic INFO field values are often comma-separated lists of values. Using the flag will cause multi-allelic records to be split into multiple lines of output (one for each allele in the ALT field); INFO field values that are not lists are copied for each of the output records while only the appropriate entry is used for lists.

ADD REPLYlink written 22 months ago by MMa260
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1501 users visited in the last hour