I downloaded these csv files from the following link:
https://human.genome.dating/download/index
These csv files are comma delimited but have white spaces as well. That is each comma is followed by a white space. So I read each file using pandas and remove the white spaces using the following:
df = pd.read_csv('/Atlas/atlas.chr1.csv',skiprows = 3, sep=',',skipinitialspace=True)
And it does remove the white spaces and then I save these csv files again which I then convert to vcf using the following command:
#awk -F, '/^#/ {next;} /^VariantID/ {printf("##fileformat=VCFv4.2\n");split($0,header);for(i=5;i<=NF;i++) printf("##INFO=<ID=%s,Number=1,Type=String,Description=\"%s\">\n",$i,$i);printf("#CHROM\tPOS\tID\tREF\tALT\tQUAL\tFILTER\tINFO\n");next ;} {printf("%s\t%s\t%s\t%s\t%s\t.\t.\t",$2,$3,$1,$5,$6);for(i=5;i<=NF;i++) printf("%s=%s;",header[i],$i);printf("\n");}' /Atlas/cleanAtlas/atlas.chr1.csv > /Atlas/cleanAtlas/atlas_chr1.vcf
But when I use GATK Picard tool to liftover to hg38 coordinates it gives me an error:
Exception in thread "main" java.lang.IllegalStateException: Key found in VariantContext field INFO at chr1:10642 but this key isn't defined in the VCFHeader. We require all VCFs to have complete VCF headers by default.
What it means is that it finds a space in INFO column which I don't specify in the header. But when I look into my file I don't see any space because I set skipinitialspace=True. And when I specify space in my header as:
##INFO=<ID= ,Number=1,Type=String,Description="empty">
Then liftover command from GATK Picard works. But then I worry that these might be messing up things and it might not be the correct way to do it. So can someone give an alternative solution or insights into what I might be doing wrong.