Question

How to convert vcf to 23andme format

1

Entering edit mode

6.6 years ago

alec_djinn ▴ 380

I have a vcf file (format VCFv4.0), generated by GATK pipeline starting from Illumina reads.

I need to convert it to 23andme file format. Example of the 23andme format:

# rsid  chromosome  position    genotype

rs4477212   1   82154   TT

rs3094315   1   752566  TC

rs3131972   1   752721  AA

rs12124819  1   776546  AC

I am having problems with plink2 --recode 23 cannot be used with multi-char alleles. Plink was recommended earlier here C: Conerting vcf to 23andMe format

I tried then to modify the vcf to remove multi-char alleles using VcfMultiToOneAllele, which did a great job but the output file, even though it looks like a vcf, it was not recognised as such by plink2 no genotype data in .vcf file. Any other tool up to the task?

Thanks for any help.

genome sequencing SNP • 15k views

ADD COMMENT • link updated 21 months ago by GenoMax 141k • written 6.6 years ago by alec_djinn ▴ 380

2

Entering edit mode

6.6 years ago

Philipp Bayer 8.3k

That doesn't look like a vcf file to me - 'multi-char alleles' appear when you have more than one alternative allele, which should be impossible if it's for a single human like 23andMe files are. Are you sure your example is your vcf file?

It should look like:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=q10,Description="Quality below 10">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
#CHROM POS     ID        REF ALT    QUAL FILTER INFO                              FORMAT      NA00001        NA00002        NA00003
20     14370   rs6054257 G      A       29   PASS   NS=3;DP=14;AF=0.5;DB;H2           GT:GQ:DP:HQ 0|0:48:1:51,51 1|0:48:8:51,51 1/1:43:5:.,.
20     17330   .         T      A       3    q10    NS=3;DP=11;AF=0.017               GT:GQ:DP:HQ 0|0:49:3:58,50 0|1:3:5:65,3   0/0:41:3
20     1110696 rs6040355 A      G,T     67   PASS   NS=2;DP=10;AF=0.333,0.667;AA=T;DB GT:GQ:DP:HQ 1|2:21:6:23,27 2|1:2:0:18,2   2/2:35:4
20     1230237 .         T      .       47   PASS   NS=3;DP=13;AA=T                   GT:GQ:DP:HQ 0|0:54:7:56,60 0|0:48:4:51,51 0/0:61:2
20     1234567 microsat1 GTCT   G,GTACT 50   PASS   NS=3;DP=9;AA=G                    GT:GQ:DP    0/1:35:4       0/2:17:2       1/1:40:3

ADD COMMENT • link 6.6 years ago by Philipp Bayer 8.3k

0

Entering edit mode

The example that I posted - was an example of 23and me format, to which I want to convert my vcf file.

This is a part of my vcf file that was recognised by plink2 as containing multi-char allele:

  #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  AM1
    1       814264  .       C       A       1121    PASS    BRF=0.27;FR=1.0000;HP=1;HapScore=1;MGOF=58;MMLQ=29;MQ=42.01;NF=19;NR=10;PP=1066
    1       814297  .       TGCT    ACTA    256     alleleBias      BRF=0.23;FR=0.5000;HP=1;HapScore=1;MGOF=71;MMLQ=25;MQ=45.1;NF=3;NR=0;P
    1       814371  .       GTGTT   C       1093    PASS    BRF=0.2;FR=0.4000;HP=2;HapScore=2;MGOF=47;MMLQ=28;MQ=41.43;NF=17;NR=7;PP=993;Q

ADD REPLY • link 6.6 years ago by alec_djinn ▴ 380

1

Entering edit mode

Aah I see - have you tried removing all lines where there are indels, i.e., where the ALT field has more than one letter (here: ACTA)? I don't think 23andMe has those

ADD REPLY • link 6.6 years ago by Philipp Bayer 8.3k

0

Entering edit mode

I actually want that info as well.

ADD REPLY • link 6.6 years ago by alec_djinn ▴ 380

2

Entering edit mode

6.6 years ago

Shab86 ▴ 310

There are a couple of ways to convert 23andme dataset to vcf:

Download 23andme dataset as a tab-delimited file with just these columns: the marker ID, chromosome name, position, and the genotype. Then use bcftools to convert the tsv file above to vcf by this:

bcftools convert --tsv2vcf input.gz -f ref.fa -s SampleName -Ov -o out.vcf

Another method would be to use your own scripts like for example this one in github: https://github.com/arrogantrobot/23andme2vcf

Also, It seems you have multiallelic sites in the 23andme dataset. Many software don't work well with that and one convention is to throw them away or to break them into single allelic sites. A useful tool here is bcftools for resolving the multi-allelic sites:

bcftools norm -m - input.vcf -o out.vcf

Finally, are you analyzing population level data? If not, why do you have multi-character alleles?

ADD COMMENT • link 6.6 years ago by Shab86 ▴ 310

0

Entering edit mode

Thank you for your comment, but I actually need to convert it the other way around. I have a vcf file generated by GATC pipeline starting from Illumina reads. and I need to convert it to 23andme file format shown above.

ADD REPLY • link 6.6 years ago by alec_djinn ▴ 380

0

Entering edit mode

I have also tried the script 23andme2vcf, it generates 23anMe file format but the last column (genotype) is empty :(

ADD REPLY • link 6.6 years ago by alec_djinn ▴ 380

0

Entering edit mode

Ahh, my mistake in interpreting it the other way around. Have you tried this: https://github.com/2sh/vcf-to-23andme

ADD REPLY • link 6.5 years ago by Shab86 ▴ 310

1

Entering edit mode

6.6 years ago

maria.vazquez ▴ 30

Hi, at Gencove we just launched an open and free API with tools that allows users to upload almost any type of DNA file (23andMe, Ancestry, FTDNA, etc). Feel free to test as user too. We give back a vcf too.

www.gencove.com/researchers

Let me know if you have any question.

ADD COMMENT • link 6.6 years ago by maria.vazquez ▴ 30

1

Entering edit mode

OP is asking for a conversion from VCF to 23andMe format. Can your tool do this?

ADD REPLY • link 6.6 years ago by GenoMax 141k

1

Entering edit mode

6.6 years ago

chrchang523 10k

There are two problems here.

The 23andMe format does not support multi-character alleles; you must reorganize your data so that none of these remain. Split length-preserving multi-nucleotide variants into a bunch of single-nucleotide variants. (As for length-changing variants, 23andMe has historically represented some common insertions with "I", some common deletions with "D", and thrown out everything else. This requires you to write a script to postprocess the VCF file, and is unlikely to be worth the trouble.)
The example data you posted is missing the rightmost two columns ("FORMAT" and the actual sample data). Assuming they exist and just failed to be copy/pasted, the errors reported by plink and other programs imply that there is no "GT" field at the beginning of the "FORMAT" column; that's the standard way of representing the actual data you want to convert to 23andMe-format. You need to figure out how to add a sufficiently-accurate GT field to your VCF.

ADD COMMENT • link 6.6 years ago by chrchang523 10k

0

Entering edit mode

Yes, it was a copy-paste error, now fixed. Thank you for your suggestion. I would've preferred if there was a ready to use tool. If not, yes I am going to code it by myself.

ADD REPLY • link 6.6 years ago by alec_djinn ▴ 380

score 3 · Accepted Answer · 2017-09-28

3

Entering edit mode

6.6 years ago

alec_djinn ▴ 380

OK, it seems I have solved it using:

plink2 --vcf [vcf file] --snps-only --recode 23

now thinking how to include single point deletions and insertions to the output file, because those are missing

ADD COMMENT • link 6.6 years ago by alec_djinn ▴ 380

0

Entering edit mode

This command returns the error: Error: Only VCF, BCF, oxford, bgen-1.x, haps, hapslegend, A, AD, Av, ped, tped, compound-genotypes, and ind-major-bed output have been implemented so far. End time: Thu Jul 21 22:02:20 2022 It seems this may have not been implemented yet

ADD REPLY • link 21 months ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Still present in plink v.1.9: https://www.cog-genomics.org/plink/1.9/data#recode

Perhaps was removed from plink2 at some point.

ADD REPLY • link 21 months ago by GenoMax 141k