I have a 1.1 TB uncompress VCF file generated from Ensembl hg38 build 84 ftp://ftp.ensembl.org/pub/release-84/fasta/homo_sapiens/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa.gz with
Originally I had 88 samples (individual vcf files) that I merged and filtered with the following command to remove indels and only to get SNPs with both alleles:
bcftools merge -l list_of_vcfs.txt | bcftools view -I -m 2 > snponly.out.vcf
I realized that my current contains scaffolds that aren't a part of the chromosome that I'm thinking about removing but that is besides the point (just wanted to mention that in case anyone is wondering about the size of the VCF file).
How can I convert the VCF to a datamatrix with samples as rows and columns as attributes?
I tried using plink
plink --recode --vcf snponly.out.vcf --out snponly.out
to get a matrix in the form of:
ID snp1_a1 snp2_a2 snp2_a1 snp2_a2 ... snpN_a1 snpN_a2
but it didn't like the "_" in my IDs which looks like the following:
and I believe it also said the file was too big
$ cat plink_vcf.e8514098
Warning: 2939537867 variant records had no GT field. Error: PLINK does not support more than 2^31 - 3 variants. We recommend other software, such as PLINK/SEQ, for very deep studies of small numbers of genomes.
Is plinkseq the way to go? If so, what specific command do I run to convert the VCF into a sample/attribute data matrix?
Is it possible that I'm lacking important fields? If so, how do I get them onto my VCF file?