Question: How to convert VCF to CSV?
0
gravatar for b.ambrozio
11 weeks ago by
b.ambrozio20
b.ambrozio20 wrote:

How can I convert VCF to CSV, so that I can use it in a classification model?

I'm trying to convert the 1000 genome phase 3 data to a CSV using plink, but no success, as I'm getting the error: Error: --export AD header line too long (>2GiB).. Here's the details:

$ du -cah
 14G    ./ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

$ ./plink2 --recode AD include-alt --vcf ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz --out 1kGp3
PLINK v2.00a2.3 64-bit (24 Jan 2020)
Options in effect:
  --export AD include-alt
  --out 1kGp3
  --vcf ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Hostname: Brunos-MacBook-Pro.local
Working directory: /Users/bambrozi/Documents/1kGp3
Start time: Sun Mar 15 18:08:45 2020

Random number seed: 1584295725
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 81271745 variants scanned.
--vcf: 1kGp3-temporary.pgen + 1kGp3-temporary.pvar + 1kGp3-temporary.psam
written.
2504 samples (0 females, 0 males, 2504 ambiguous; 2504 founders) loaded from
1kGp3-temporary.psam.
81271745 variants loaded from 1kGp3-temporary.pvar.
Note: No phenotype data present.
Error: --export AD header line too long (>2GiB).

End time: Sun Mar 15 18:28:36 2020

The ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz contains the 22 chromosomes concatenated.

I need it to train a classification model...

Thanks for the help!

plink csv vcf • 186 views
ADD COMMENTlink modified 11 weeks ago by chrchang5236.9k • written 11 weeks ago by b.ambrozio20

This tool should help. A simple Google search reveals lots of answers, including some on Biostars.

ADD REPLYlink written 11 weeks ago by Mensur Dlakic5.5k
0
gravatar for chrchang523
11 weeks ago by
chrchang5236.9k
United States
chrchang5236.9k wrote:

Is the “AD” format (>162 million columns) really what you want here?! “A-transpose” is a much saner choice in this context.

ADD COMMENTlink modified 11 weeks ago • written 11 weeks ago by chrchang5236.9k

Well, I guess not. Actually I've reduced the scope to 3 chromosomes, and removed the "D", thus I managed to generate. But now I'm facing issues to load it and work on my pandas, pySpark, etc... I think I have to change the strategy, and try to, some how, run my classification models straight from the VCF's instead.

ADD REPLYlink written 11 weeks ago by b.ambrozio20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1920 users visited in the last hour