Question: convert ancestry.com text file to vcf
1
gravatar for LTDavid
6 months ago by
LTDavid20
LTDavid20 wrote:

How do you convert a text file from Ancestry.com to vcf format?

I understand that I could convert from 23andMe to vcf with something like:

bcftools convert -c ID,CHROM,POS,AA -s SampleFile -f reference/Homo_sapiens.GRCh37.dna.primary_assembly.fa --tsv2vcf Data/SampleFile/AncestryDNA.txt -Oz -o Data/SampleFile.vcf.gz

However, Ancestry.com's files are slightly different from 23andMe files. Ancestry.com's files presents as five TAB delimited columns instead of four like 23andMe.

    rsid    chromosome  position    allele1 allele2
rs3131972   1   752721  A   G
rs114525117 1   759036  G   G
rs12124819  1   776546  A   A

l also tried a direct conversion but have something wrong because it's not working:

cat SampleFile.zip|grep -v '#'|grep -v 'rsid'|awk -F'\t' '{ print $1"\t"$2"\t"$3"\t"$4$5; }'|sed s/\\t23\\t/\\tX\\t\/g |sed s/\\t24\\t/\\tY\\t\/g| grep -P -v '\t25\t' >> SampleFile.txt

With Ancestry.com, a generic text file name is within the zip file such that I would need to use the basename that I saved it as for the converted file name. For example:

SampleFile1.zip/AncestryDNA.txt > SampleFile1.txt
SampleFile2.zip/AncestryDNA.txt > SampleFile2.txt

I'm using these files for Beagle 5.1 which has an exception to the vcf format for male chromsomes:

Beagle uses Variant Call Format (VCF) 4.3 for input and output genotype data, except that Beagle requires male non-pseudoautosomal X-chromosome genotypes to be coded as homozygous diploid genotypes.

I'm using Ubuntu 18.04.3 LTS.

ADD COMMENTlink modified 6 months ago • written 6 months ago by LTDavid20
1
gravatar for LTDavid
6 months ago by
LTDavid20
LTDavid20 wrote:

I decided to try to convert the ancestry.com txt file to a 23andMe formatted txt file, which may could then be used in the existing bcftools convert command. I got it to work up to converting the format from ancestry.com to 23andMe using this:

7z x SampleFile.zip ; mv AncestryDNA.txt SampleFile.txt
gawk -i inplace -F'\t' '{ print $1"\t"$2"\t"$3"\t"$4$5; }' ${file%.zip}.txt

To load Homo_sapiens.GRCh37.dna.primary_assembly.fa to use in the bcftools convert command.

wget http://ftp.ensembl.org/pub/release 75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

To convert the new Ancestry.com (formatted as 23andMe) to vcf format

bcftools convert -c ID,CHROM,POS,AA -s SampleFile23 --haploid2diploid -f /home/reference/references/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --tsv2vcf SampleFile23.txt -Oz -o SampleFile23.vcf.gz

This seems to work by listing for chrom 1 - 22 (though I haven't compared it to the original Ancestry.com zip file).

...
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SampleFile23
1       752721  rs3131972       A       G       .       .       .       GT      0/1
1       759036  rs114525117     G       A       .       .       .       GT      1/0
...
22      51064818        rs762672        T       C       .       .       .       GT      1/1
22      51064898        rs1106788       G       A       .       .       .       GT      1/0
...

Okay, it looks like I provided rudimentary answer to my question.

And this seems to be working as a rudimentary script. The last line is still running but the MergedSamples file has been created with enough to see that it's merging.

@echo off
setlocal EnableDelayedExpansion

for file in inputs/*.zip; do 
        echo "converting to vcf.gz: " $file
        7z x $file
        mv AncestryDNA.txt ${file%.zip}.txt
        gawk -i inplace -F'\t' '{ print $1"\t"$2"\t"$3"\t"$4$5; }' ${file%.zip}.txt  
        bcftools convert -c ID,CHROM,POS,AA -s ${file%.zip} \
                --haploid2diploid \
                -f /home/reference/references/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
                --tsv2vcf ${file%.zip}.txt \
                -Oz -o ${file%.zip}.vcf.gz

done

for file in inputs/*.vcf.gz; do
        echo "indexing sample vcf file" $file
        tabix $file
done

cd inputs
for files in *.vcf.gz; do bcftools merge -o Results/MergedSamples *.vcf.gz; done

The run time for this script was 32 minutes and 26.08 seconds. 4 vCPUs, 3.6 GB memory. 32 samples with about 700,000 SNPs each.

ADD COMMENTlink modified 6 months ago • written 6 months ago by LTDavid20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1995 users visited in the last hour