Question

No gene name after annovar vcf file processing

0

Entering edit mode

4.5 years ago

valerie ▴ 100

Hi everyone, I have a vcf file for whole exome sequence. The file looks like that:

chr6    3118947 rs2231355   A   G   1071.77 PASS    AC=2;AF=1.00;AN=2;DB;DP=35;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=30.62;SOR=2.115;set=variant    GT:AD:DP:GQ:PL  1|1:0,35:35:90:1100,90,0

I wanted to add gene names to the file and read the annovar can be helpful for that (clinvar data is also a good thing for me). I used the following code from the manual:

./table_annovar.pl patient.final.vcf humandb/ -buildver hg19 -out patient_anno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation gx,r,f,f,f -nastring . -csvout -polish -xref example/gene_xref.txt

But there are no gene names and no clinvar data, the output file patient_anno.hg19_multianno.csv looks like that:

chr1,860799,.,G,C,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.

What am I doing wrong? Thanks!

vcf annovar • 1.6k views

ADD COMMENT • link 4.5 years ago by valerie ▴ 100

0

Entering edit mode

Are you sure there is a gene at chr1:860799?

ADD REPLY • link 4.5 years ago by Ram 43k

0

Entering edit mode

There is nothing but '.' in Gene.refGene column

ADD REPLY • link 4.5 years ago by valerie ▴ 100

0

Entering edit mode

Do you mean there is no value across all sites? Can you check the refGene annotation file under humandb/?

It should have gene symbols in the 13th column, I think.

ADD REPLY • link 4.5 years ago by Ram 43k

0

Entering edit mode

Yes, no values. hg19_refGene.txt looks pretty normal, there are gene symbols in the 13th column. I think that the issue is not with names only, but something goes wrong in general. There are no values in any columns, but the first 5 (chr, position, snp, alt, ref). But I cannot understand what's wrong.

ADD REPLY • link 4.5 years ago by valerie ▴ 100

0

Entering edit mode

Can you try replacing the gx with just g and removing the -xref example?gene_xref.txt part?

ADD REPLY • link 4.5 years ago by Ram 43k

0

Entering edit mode

Didn't help :( I also noticed that patient.refGene.invalid_input file is twice larger then the patient.hg19_multianno.csv file

ADD REPLY • link 4.5 years ago by valerie ▴ 100

0

Entering edit mode

Can you look into the log files? There should be something that points to why you're seeing invalid input.

ADD REPLY • link 4.5 years ago by Ram 43k

0

Entering edit mode

The first step looks ok to me: NOTICE: Processing operation=g protocol=refGene

NOTICE: Running with system command <annotate_variation.pl -geneanno="" -buildver="" hg19="" -dbtype="" refgene="" -outfile="" patient.refgene="" -exonsort="" patient.final.vcf="" humandb=""/> NOTICE: Output files were written to patient.refGene.variant_function, patient.refGene.exonic_variant_function NOTICE: Reading gene annotation from humandb/hg19_refGene.txt ... Done with 72212 transcripts (including 17527 without coding sequence annotation) for 28250 unique genes NOTICE: Variants with invalid input format were written to patient.refGene.invalid_input

But the second one is strange, it didn't find anything:

NOTICE: Processing operation=r protocol=cytoBand

NOTICE: Running with system command <annotate_variation.pl -regionanno="" -dbtype="" cytoband="" -buildver="" hg19="" -outfile="" patient_anno_v2="" patient.final.vcf="" humandb=""/> NOTICE: Output file is written to patient.hg19_cytoBand NOTICE: Reading annotation database humandb/hg19_cytoBand.txt ... Done with 862 regions NOTICE: Finished region-based annotation on 0 genetic variants NOTICE: Variants with invalid input format were written to patient.invalid_input

ADD REPLY • link 4.5 years ago by valerie ▴ 100

0

Entering edit mode

Please copy paste the command again, this time use the code formatting button after selecting the pasted content. Right now, the website has parsed out the content you pasted as though it were HTML, which it is not.

code_formatting

ADD REPLY • link 4.5 years ago by Ram 43k

0

Entering edit mode

Thank you very much for your help! This is the command:

./table_annovar.pl patient.final.vcf humandb/ -buildver hg19 -out patient_anno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation g,r,f,f,f -nastring . -csvout -polish

This is the output:

NOTICE: Processing operation=g protocol=refGene
NOTICE: Running with system command <annotate_variation.pl -geneanno -buildver hg19 -dbtype refGene -outfile patient.refGene -exonsort patient.final.vcf humandb/>
NOTICE: Output files were written to patient.refGene.variant_function, patient_anno.refGene.exonic_variant_function
NOTICE: Reading gene annotation from humandb/hg19_refGene.txt ... Done with 72212 transcripts (including 17527 without coding sequence annotation) for 28250 unique genes
NOTICE: Variants with invalid input format were written to patient_anno.refGene.invalid_input

NOTICE: Processing operation=r protocol=cytoBand

NOTICE: Running with system command <annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 -outfile patient_anno patient.final.vcf humandb/>
NOTICE: Output file is written to patient_anno.hg19_cytoBand
NOTICE: Reading annotation database humandb/hg19_cytoBand.txt ... Done with 862 regions
NOTICE: Finished region-based annotation on 0 genetic variants
NOTICE: Variants with invalid input format were written to patient_anno.invalid_input

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 4.5 years ago by valerie ▴ 100

0

Entering edit mode

Can you also paste the output of

head patient_anno.invalid_input

ADD REPLY • link 4.5 years ago by Ram 43k

0

Entering edit mode

Head:

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=StandardFilter,Description="QD<2.0 || ReadPosRankSum<-20.0 || InbreedingCoeff<-0.8 || FS>200.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=HP,Number=.,Type=String,Description="Read-backed phasing haplotype identifiers">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PQ,Number=1,Type=Float,Description="Read-backed phasing quality">

Tail:

chrM_NC_012920.1    4769    .   A   G   6897.77 StandardFilter  AC=2;AF=1.00;AN=2;DP=244;Dels=0.00;FS=0.000;HaplotypeScore=1.9164;MLEAC=2;MLEAF=1.00;MQ=33.75;MQ0=30;QD=28.27;SOR=3.751;set=FilteredInAll   GT:AD:DP:GQ:PL:NB   1/1:0,244:244:99:6926,560,0:N2500002
chrM_NC_012920.1    7028    .   C   T   7105.77 StandardFilter  AC=2;AF=1.00;AN=2;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=5.1179;MLEAC=2;MLEAF=1.00;MQ=33.33;MQ0=30;QD=28.42;SOR=1.336;set=FilteredInAll   GT:AD:DP:GQ:PL  1/1:0,250:250:99:7134,626,0
chrM_NC_012920.1    8860    .   A   G   5083.77 StandardFilter  AC=2;AF=1.00;AN=2;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=9.8492;MLEAC=2;MLEAF=1.00;MQ=28.02;MQ0=63;QD=20.34;SOR=0.771;set=FilteredInAll   GT:AD:DP:GQ:PL  1/1:0,249:250:99:5112,469,0
chrM_NC_012920.1    13188   .   C   T   8517.77 StandardFilter  AC=2;AF=1.00;AN=2;BaseQRankSum=-2.972;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=19.0163;MLEAC=2;MLEAF=1.00;MQ=59.85;MQ0=0;MQRankSum=1.539;QD=34.07;ReadPosRankSum=-0.915;SOR=0.595;set=FilteredInAll GT:AD:DP:GQ:PL  1/1:3,247:250:99:8546,657,0
chrM_NC_012920.1    14766   .   C   T   8097.77 PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=1.718;DP=249;Dels=0.00;FS=0.000;HaplotypeScore=10.8582;MLEAC=2;MLEAF=1.00;MQ=59.97;MQ0=0;MQRankSum=-0.768;QD=32.52;ReadPosRankSum=-0.405;SOR=0.994;set=variant   GT:AD:DP:GQ:PL  1|1:1,248:249:99:8126,695,0
chrM_NC_012920.1    15326   .   A   G   9171.77 PASS    AC=2;AF=1.00;AN=2;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=4.3890;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=29.71;SOR=1.552;set=variant  GT:AD:DP:GQ:PL  1|1:1,249:250:99:9200,728,0
chrM_NC_012920.1    15674   .   T   C   8449.77 PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=0.077;DP=249;Dels=0.00;FS=3.396;HaplotypeScore=5.9530;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=-0.288;QD=33.93;ReadPosRankSum=-1.087;SOR=0.517;set=variant    GT:AD:DP:GQ:PL  1|1:1,247:249:99:8478,676,0
chrM_NC_012920.1    16126   .   T   C   2601.77 PASS    AC=2;AF=1.00;AN=2;DP=72;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=29.30;SOR=3.521;set=variant   GT:AD:DP:GQ:PL  1|1:0,72:72:99:2630,205,0
chrM_NC_012920.1    16362   .   T   C   516.77  PASS    AC=2;AF=1.00;AN=2;DP=16;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.30;SOR=0.941;set=variant   GT:AD:DP:GQ:PL  1|1:0,16:16:42:545,42,0
chrM_NC_012920.1    16519   .   T   C   880.77  PASS    AC=2;AF=1.00;AN=2;DP=27;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=57.16;MQ0=0;QD=32.62;SOR=0.765;set=variant   GT:AD:DP:GQ:PL  1|1:0,27:27:72:909,72,0

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 4.5 years ago by valerie ▴ 100

0

Entering edit mode

Can you check the annotation file and the VCF file to see if the contig names match? That could be a problem.

ADD REPLY • link 4.5 years ago by Ram 43k

0

Entering edit mode

Most of the snps are located on assembled chromosomes, and they are named like 'chr1' in both files. But it looks like there are no contiguous in the annotation file. Do you think I should try to remove them from vcf?

ADD REPLY • link 4.5 years ago by valerie ▴ 100

0

Entering edit mode

Didn't help :( Now tail for the patient_anno.invalid_input file looks like this:

chrY    28583314    .   A   G   126.77  StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=1.558;DP=191;Dels=0.00;FS=16.763;HaplotypeScore=8.3809;MLEAC=1;MLEAF=0.500;MQ=43.39;MQ0=32;MQRankSum=-10.599;QD=0.66;ReadPosRankSum=-3.741;SOR=4.734;set=FilteredInAll  GT:AD:DP:GQ:PL:NB   0/1:96,94:191:99:155,0,3007:N2400007
chrY    28583335    .   T   C   104.77  StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=4.225;DP=155;Dels=0.00;FS=1.633;HaplotypeScore=8.2860;MLEAC=1;MLEAF=0.500;MQ=46.89;MQ0=21;MQRankSum=-9.098;QD=0.68;ReadPosRankSum=4.395;SOR=2.760;set=FilteredInAll GT:AD:DP:GQ:PL  0/1:91,61:154:99:133,0,2923
chrY    28583382    .   G   C   282.77  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=4.635;DP=120;Dels=0.00;FS=0.000;HaplotypeScore=2.1301;MLEAC=1;MLEAF=0.500;MQ=40.95;MQ0=21;MQRankSum=-8.566;QD=2.36;ReadPosRankSum=-1.573;SOR=0.597;set=variant  GT:AD:DP:GQ:HP:PL:PQ:PB 0|1:52,68:120:99:28582863-1,28582863-2:311,0,1736:2056.19:P2400006
chrY    28583394    .   T   C   216.77  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=1.109;DP=104;Dels=0.00;FS=4.102;HaplotypeScore=1.5741;MLEAC=1;MLEAF=0.500;MQ=44.87;MQ0=13;MQRankSum=-7.970;QD=2.08;ReadPosRankSum=0.182;SOR=0.093;set=variant   GT:AD:DP:GQ:HP:PL:PQ:PB 0|1:55,49:104:99:28582863-1,28582863-2:245,0,1806:2193.86:P2400006
chrY    28583410    .   C   G   57.77   StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=-2.891;DP=75;Dels=0.00;FS=2.733;HaplotypeScore=6.3727;MLEAC=1;MLEAF=0.500;MQ=48.89;MQ0=5;MQRankSum=-6.686;QD=0.77;ReadPosRankSum=2.068;SOR=0.191;set=FilteredInAll  GT:AD:DP:GQ:PL:NB   0/1:48,27:75:86:86,0,1626:N2400008
chrY    28583415    .   T   C   52.77   StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=-4.835;DP=70;Dels=0.00;FS=2.861;HaplotypeScore=4.7985;MLEAC=1;MLEAF=0.500;MQ=49.54;MQ0=4;MQRankSum=-6.425;QD=0.75;ReadPosRankSum=1.709;SOR=0.173;set=FilteredInAll  GT:AD:DP:GQ:PL:NB   0/1:46,24:70:81:81,0,1597:N2400008
chrY    28583431    .   A   T   64.77   StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=3.238;DP=61;Dels=0.00;FS=0.000;HaplotypeScore=1.9913;MLEAC=1;MLEAF=0.500;MQ=47.63;MQ0=5;MQRankSum=-5.982;QD=1.06;ReadPosRankSum=-1.031;SOR=0.720;set=FilteredInAll  GT:AD:DP:GQ:PL:NB   0/1:38,23:61:93:93,0,1187:N2400009
chrY    28583432    rs77932961  A   G   64.77   StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=4.539;DB;DP=65;Dels=0.00;FS=0.000;HaplotypeScore=2.5696;MLEAC=1;MLEAF=0.500;MQ=48.48;MQ0=5;MQRankSum=-6.147;QD=1.00;ReadPosRankSum=-0.862;SOR=0.719;set=FilteredInAll   GT:AD:DP:GQ:PL:NB   0/1:40,24:65:93:93,0,1080:N2400009
chrY    28583487    rs78125809  C   T   119.77  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=-0.759;DB;DP=51;Dels=0.00;FS=2.660;HaplotypeScore=2.5672;MLEAC=1;MLEAF=0.500;MQ=51.24;MQ0=1;MQRankSum=-5.667;QD=2.35;ReadPosRankSum=-1.009;SOR=0.200;set=variant    GT:AD:DP:GQ:HP:PL:PQ:PB 0|1:34,17:51:99:28582863-1,28582863-2:148,0,1087:957.62:P2400006
chrY    28583500    rs76728162  C   T   169.77  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=-2.233;DB;DP=35;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=49.99;MQ0=1;MQRankSum=-4.855;QD=4.85;ReadPosRankSum=-1.807;SOR=0.324;set=variant    GT:AD:DP:GQ:HP:PL:PQ:PB 0|1:21,14:35:99:28582863-1,28582863-2:198,0,704:1023.79:P2400006

ADD REPLY • link 4.5 years ago by valerie ▴ 100

0

Entering edit mode

Are you sure the reference sequence used in the VCF was hg19 and not hg38 or something else?

ADD REPLY • link 4.5 years ago by Ram 43k

0

Entering edit mode

Yes. Moreover, I tried using another example from the manual:

./table_annovar.pl patient.final.vcf  humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation g,r,f,f,f -nastring . -vcfinput -polish

And I see that in the temporary file myanno.refGene.exonic_variant_function there is gene annotation! It is in awful format, but it exists. And 'myanno.invalid_input' is an empty file. This is crazy :)

ADD REPLY • link 4.5 years ago by valerie ▴ 100

0

Entering edit mode

That's really odd. I'd recommend picking the first 1000 lines of your VCF and trying the annovar command you used earlier, tweaking it until it works. The -protocol and -operation formats are pretty stupid, if you ask me.

In the meantime, you can always use VEP to get the annotations you need.

ADD REPLY • link 4.5 years ago by Ram 43k