Question: convert text file to vcf
6 months ago
LTDavid20 wrote:

How do you convert a text file from to vcf format?

I understand that I could convert from 23andMe to vcf with something like:

bcftools convert -c ID,CHROM,POS,AA -s SampleFile -f reference/Homo_sapiens.GRCh37.dna.primary_assembly.fa --tsv2vcf Data/SampleFile/AncestryDNA.txt -Oz -o Data/SampleFile.vcf.gz

However,'s files are slightly different from 23andMe files.'s files presents as five TAB delimited columns instead of four like 23andMe.

    rsid    chromosome  position    allele1 allele2
rs3131972   1   752721  A   G
rs114525117 1   759036  G   G
rs12124819  1   776546  A   A

l also tried a direct conversion but have something wrong because it's not working:

cat|grep -v '#'|grep -v 'rsid'|awk -F'\t' '{ print $1"\t"$2"\t"$3"\t"$4$5; }'|sed s/\\t23\\t/\\tX\\t\/g |sed s/\\t24\\t/\\tY\\t\/g| grep -P -v '\t25\t' >> SampleFile.txt

With, a generic text file name is within the zip file such that I would need to use the basename that I saved it as for the converted file name. For example: > SampleFile1.txt > SampleFile2.txt

I'm using these files for Beagle 5.1 which has an exception to the vcf format for male chromsomes:

Beagle uses Variant Call Format (VCF) 4.3 for input and output genotype data, except that Beagle requires male non-pseudoautosomal X-chromosome genotypes to be coded as homozygous diploid genotypes.

I'm using Ubuntu 18.04.3 LTS.

modified 6 months ago • written 6 months ago
6 months ago
LTDavid20 wrote:

I decided to try to convert the txt file to a 23andMe formatted txt file, which may could then be used in the existing bcftools convert command. I got it to work up to converting the format from to 23andMe using this:

7z x ; mv AncestryDNA.txt SampleFile.txt
gawk -i inplace -F'\t' '{ print $1"\t"$2"\t"$3"\t"$4$5; }' ${}.txt

To load Homo_sapiens.GRCh37.dna.primary_assembly.fa to use in the bcftools convert command.

wget 75/fasta/homo_sapiens/dna/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz
gunzip Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz

To convert the new (formatted as 23andMe) to vcf format

bcftools convert -c ID,CHROM,POS,AA -s SampleFile23 --haploid2diploid -f /home/reference/references/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --tsv2vcf SampleFile23.txt -Oz -o SampleFile23.vcf.gz

This seems to work by listing for chrom 1 - 22 (though I haven't compared it to the original zip file).

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  SampleFile23
1       752721  rs3131972       A       G       .       .       .       GT      0/1
1       759036  rs114525117     G       A       .       .       .       GT      1/0
22      51064818        rs762672        T       C       .       .       .       GT      1/1
22      51064898        rs1106788       G       A       .       .       .       GT      1/0

Okay, it looks like I provided rudimentary answer to my question.

And this seems to be working as a rudimentary script. The last line is still running but the MergedSamples file has been created with enough to see that it's merging.

@echo off
setlocal EnableDelayedExpansion

for file in inputs/*.zip; do 
        echo "converting to vcf.gz: " $file
        7z x $file
        mv AncestryDNA.txt ${}.txt
        gawk -i inplace -F'\t' '{ print $1"\t"$2"\t"$3"\t"$4$5; }' ${}.txt  
        bcftools convert -c ID,CHROM,POS,AA -s ${} \
                --haploid2diploid \
                -f /home/reference/references/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa \
                --tsv2vcf ${}.txt \
                -Oz -o ${}.vcf.gz


for file in inputs/*.vcf.gz; do
        echo "indexing sample vcf file" $file
        tabix $file

cd inputs
for files in *.vcf.gz; do bcftools merge -o Results/MergedSamples *.vcf.gz; done

The run time for this script was 32 minutes and 26.08 seconds. 4 vCPUs, 3.6 GB memory. 32 samples with about 700,000 SNPs each.

modified 6 months ago • written 6 months ago
