Having some trouble splitting my gnomAD database info field from the vcf info field in my ANNOVAR multianno.txt file. I had to use bcftools to merge the database annotation into the annovar input vcf to avoid the problem of annovar only outputting frequency data.
Here are some examples of the entries in the column I'm having trouble with. Columns are tab separated, so I am trying to essentially insert a tab at specific points in these entries.
CONTQ=93;DP=555;ECNT=4;MBQ=30,20,30;MFRL=181,172,212;MMQ=60,60,60;MPOS=6,21;OCM=0;POPAF=2.4,2.4;SEQQ=93;STRANDQ=93;TLOD=19.94,1805.91;qual=-10;filters=artifact_prone_site;*(etc etc etc etc)*
CONTQ=93;DP=801;ECNT=5;MBQ=30,10;MFRL=190,230;MMQ=60,60;MPOS=18;OCM=0;POPAF=2.4;SEQQ=2;STRANDQ=1;TLOD=3.34;qual=-10;filters=npg;*(etc etc etc)*
CONTQ=93;DP=812;ECNT=5;MBQ=30,20;MFRL=191,310;MMQ=60,60;MPOS=13;OCM=0;POPAF=2.4;SEQQ=1;STRANDQ=1;TLOD=0.024
Everything to the right of the TLOD= entry is gnomAD data. As you can see, sometimes there is no gnomAD entry, and sometimes TLOD= has multiple values, so I'm struggling to craft an effective regex in sed/awk.
Is there a simple programmatic way to do this? Or better yet, is there a way to get bcftools to put the gnomad data in its own info column before it goes through annovar?
This is my bcftools input:
bcftools annotate --force -a ./db.vcf.gz -c INFO ./input.vcf.gz > ./output.vcf
You could try to standardize the info fields in your VCF file before annotating it with Annovar. Maybe something like
bcftools query -f '%CHROM\t%POS\t%ID\t%REF\t%ALT\t%QUAL\t%FILTER\tCONTQ=%CONTQ;DP=%DP;ECNT=%ECNT;MBQ=%MBQ;MFRL=%MFRL;MMQ=%MMQ;MPOS=%MPOS;OCM=%OCM;POPAF=%POPAF;SEQQ=%SEQQ;STRANDQ=%STRANDQ;TLOD=%TLOD;qual=%qual;filters=%filters;\n' input.vcf >> output.vcf