Question: No gene name after annovar vcf file processing
0
gravatar for valerie
4 weeks ago by
valerie60
valerie60 wrote:

Hi everyone, I have a vcf file for whole exome sequence. The file looks like that:

chr6    3118947 rs2231355   A   G   1071.77 PASS    AC=2;AF=1.00;AN=2;DB;DP=35;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=30.62;SOR=2.115;set=variant    GT:AD:DP:GQ:PL  1|1:0,35:35:90:1100,90,0

I wanted to add gene names to the file and read the annovar can be helpful for that (clinvar data is also a good thing for me). I used the following code from the manual:

./table_annovar.pl patient.final.vcf humandb/ -buildver hg19 -out patient_anno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation gx,r,f,f,f -nastring . -csvout -polish -xref example/gene_xref.txt

But there are no gene names and no clinvar data, the output file patient_anno.hg19_multianno.csv looks like that:

chr1,860799,.,G,C,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.,.

What am I doing wrong? Thanks!

annovar vcf • 112 views
ADD COMMENTlink written 4 weeks ago by valerie60

Are you sure there is a gene at chr1:860799?

ADD REPLYlink written 4 weeks ago by RamRS24k

There is nothing but '.' in Gene.refGene column

ADD REPLYlink written 4 weeks ago by valerie60

Do you mean there is no value across all sites? Can you check the refGene annotation file under humandb/?

It should have gene symbols in the 13th column, I think.

ADD REPLYlink written 4 weeks ago by RamRS24k

Yes, no values. hg19_refGene.txt looks pretty normal, there are gene symbols in the 13th column. I think that the issue is not with names only, but something goes wrong in general. There are no values in any columns, but the first 5 (chr, position, snp, alt, ref). But I cannot understand what's wrong.

ADD REPLYlink written 4 weeks ago by valerie60

Can you try replacing the gx with just g and removing the -xref example?gene_xref.txt part?

ADD REPLYlink written 4 weeks ago by RamRS24k

Didn't help :( I also noticed that patient.refGene.invalid_input file is twice larger then the patient.hg19_multianno.csv file

ADD REPLYlink written 4 weeks ago by valerie60

Can you look into the log files? There should be something that points to why you're seeing invalid input.

ADD REPLYlink written 4 weeks ago by RamRS24k

The first step looks ok to me: NOTICE: Processing operation=g protocol=refGene

NOTICE: Running with system command <annotate_variation.pl -geneanno="" -buildver="" hg19="" -dbtype="" refgene="" -outfile="" patient.refgene="" -exonsort="" patient.final.vcf="" humandb=""/> NOTICE: Output files were written to patient.refGene.variant_function, patient.refGene.exonic_variant_function NOTICE: Reading gene annotation from humandb/hg19_refGene.txt ... Done with 72212 transcripts (including 17527 without coding sequence annotation) for 28250 unique genes NOTICE: Variants with invalid input format were written to patient.refGene.invalid_input

But the second one is strange, it didn't find anything:

NOTICE: Processing operation=r protocol=cytoBand

NOTICE: Running with system command <annotate_variation.pl -regionanno="" -dbtype="" cytoband="" -buildver="" hg19="" -outfile="" patient_anno_v2="" patient.final.vcf="" humandb=""/> NOTICE: Output file is written to patient.hg19_cytoBand NOTICE: Reading annotation database humandb/hg19_cytoBand.txt ... Done with 862 regions NOTICE: Finished region-based annotation on 0 genetic variants NOTICE: Variants with invalid input format were written to patient.invalid_input

ADD REPLYlink written 4 weeks ago by valerie60

Please copy paste the command again, this time use the code formatting button after selecting the pasted content. Right now, the website has parsed out the content you pasted as though it were HTML, which it is not.

code_formatting

ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by RamRS24k

Thank you very much for your help! This is the command:

./table_annovar.pl patient.final.vcf humandb/ -buildver hg19 -out patient_anno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation g,r,f,f,f -nastring . -csvout -polish

This is the output:

NOTICE: Processing operation=g protocol=refGene
NOTICE: Running with system command <annotate_variation.pl -geneanno -buildver hg19 -dbtype refGene -outfile patient.refGene -exonsort patient.final.vcf humandb/>
NOTICE: Output files were written to patient.refGene.variant_function, patient_anno.refGene.exonic_variant_function
NOTICE: Reading gene annotation from humandb/hg19_refGene.txt ... Done with 72212 transcripts (including 17527 without coding sequence annotation) for 28250 unique genes
NOTICE: Variants with invalid input format were written to patient_anno.refGene.invalid_input

NOTICE: Processing operation=r protocol=cytoBand

NOTICE: Running with system command <annotate_variation.pl -regionanno -dbtype cytoBand -buildver hg19 -outfile patient_anno patient.final.vcf humandb/>
NOTICE: Output file is written to patient_anno.hg19_cytoBand
NOTICE: Reading annotation database humandb/hg19_cytoBand.txt ... Done with 862 regions
NOTICE: Finished region-based annotation on 0 genetic variants
NOTICE: Variants with invalid input format were written to patient_anno.invalid_input
ADD REPLYlink modified 4 weeks ago by RamRS24k • written 4 weeks ago by valerie60

Can you also paste the output of

head patient_anno.invalid_input
ADD REPLYlink written 4 weeks ago by RamRS24k

Head:

##fileformat=VCFv4.1
##FILTER=<ID=LowQual,Description="Low quality">
##FILTER=<ID=StandardFilter,Description="QD<2.0 || ReadPosRankSum<-20.0 || InbreedingCoeff<-0.8 || FS>200.0">
##FORMAT=<ID=AD,Number=.,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=HP,Number=.,Type=String,Description="Read-backed phasing haplotype identifiers">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PQ,Number=1,Type=Float,Description="Read-backed phasing quality">

Tail:

chrM_NC_012920.1    4769    .   A   G   6897.77 StandardFilter  AC=2;AF=1.00;AN=2;DP=244;Dels=0.00;FS=0.000;HaplotypeScore=1.9164;MLEAC=2;MLEAF=1.00;MQ=33.75;MQ0=30;QD=28.27;SOR=3.751;set=FilteredInAll   GT:AD:DP:GQ:PL:NB   1/1:0,244:244:99:6926,560,0:N2500002
chrM_NC_012920.1    7028    .   C   T   7105.77 StandardFilter  AC=2;AF=1.00;AN=2;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=5.1179;MLEAC=2;MLEAF=1.00;MQ=33.33;MQ0=30;QD=28.42;SOR=1.336;set=FilteredInAll   GT:AD:DP:GQ:PL  1/1:0,250:250:99:7134,626,0
chrM_NC_012920.1    8860    .   A   G   5083.77 StandardFilter  AC=2;AF=1.00;AN=2;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=9.8492;MLEAC=2;MLEAF=1.00;MQ=28.02;MQ0=63;QD=20.34;SOR=0.771;set=FilteredInAll   GT:AD:DP:GQ:PL  1/1:0,249:250:99:5112,469,0
chrM_NC_012920.1    13188   .   C   T   8517.77 StandardFilter  AC=2;AF=1.00;AN=2;BaseQRankSum=-2.972;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=19.0163;MLEAC=2;MLEAF=1.00;MQ=59.85;MQ0=0;MQRankSum=1.539;QD=34.07;ReadPosRankSum=-0.915;SOR=0.595;set=FilteredInAll GT:AD:DP:GQ:PL  1/1:3,247:250:99:8546,657,0
chrM_NC_012920.1    14766   .   C   T   8097.77 PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=1.718;DP=249;Dels=0.00;FS=0.000;HaplotypeScore=10.8582;MLEAC=2;MLEAF=1.00;MQ=59.97;MQ0=0;MQRankSum=-0.768;QD=32.52;ReadPosRankSum=-0.405;SOR=0.994;set=variant   GT:AD:DP:GQ:PL  1|1:1,248:249:99:8126,695,0
chrM_NC_012920.1    15326   .   A   G   9171.77 PASS    AC=2;AF=1.00;AN=2;DP=250;Dels=0.00;FS=0.000;HaplotypeScore=4.3890;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=29.71;SOR=1.552;set=variant  GT:AD:DP:GQ:PL  1|1:1,249:250:99:9200,728,0
chrM_NC_012920.1    15674   .   T   C   8449.77 PASS    AC=2;AF=1.00;AN=2;BaseQRankSum=0.077;DP=249;Dels=0.00;FS=3.396;HaplotypeScore=5.9530;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;MQRankSum=-0.288;QD=33.93;ReadPosRankSum=-1.087;SOR=0.517;set=variant    GT:AD:DP:GQ:PL  1|1:1,247:249:99:8478,676,0
chrM_NC_012920.1    16126   .   T   C   2601.77 PASS    AC=2;AF=1.00;AN=2;DP=72;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=29.30;SOR=3.521;set=variant   GT:AD:DP:GQ:PL  1|1:0,72:72:99:2630,205,0
chrM_NC_012920.1    16362   .   T   C   516.77  PASS    AC=2;AF=1.00;AN=2;DP=16;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=60.00;MQ0=0;QD=32.30;SOR=0.941;set=variant   GT:AD:DP:GQ:PL  1|1:0,16:16:42:545,42,0
chrM_NC_012920.1    16519   .   T   C   880.77  PASS    AC=2;AF=1.00;AN=2;DP=27;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=2;MLEAF=1.00;MQ=57.16;MQ0=0;QD=32.62;SOR=0.765;set=variant   GT:AD:DP:GQ:PL  1|1:0,27:27:72:909,72,0
ADD REPLYlink modified 4 weeks ago by RamRS24k • written 4 weeks ago by valerie60

Can you check the annotation file and the VCF file to see if the contig names match? That could be a problem.

ADD REPLYlink modified 29 days ago • written 4 weeks ago by RamRS24k

Most of the snps are located on assembled chromosomes, and they are named like 'chr1' in both files. But it looks like there are no contiguous in the annotation file. Do you think I should try to remove them from vcf?

ADD REPLYlink written 4 weeks ago by valerie60

Didn't help :( Now tail for the patient_anno.invalid_input file looks like this:

chrY    28583314    .   A   G   126.77  StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=1.558;DP=191;Dels=0.00;FS=16.763;HaplotypeScore=8.3809;MLEAC=1;MLEAF=0.500;MQ=43.39;MQ0=32;MQRankSum=-10.599;QD=0.66;ReadPosRankSum=-3.741;SOR=4.734;set=FilteredInAll  GT:AD:DP:GQ:PL:NB   0/1:96,94:191:99:155,0,3007:N2400007
chrY    28583335    .   T   C   104.77  StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=4.225;DP=155;Dels=0.00;FS=1.633;HaplotypeScore=8.2860;MLEAC=1;MLEAF=0.500;MQ=46.89;MQ0=21;MQRankSum=-9.098;QD=0.68;ReadPosRankSum=4.395;SOR=2.760;set=FilteredInAll GT:AD:DP:GQ:PL  0/1:91,61:154:99:133,0,2923
chrY    28583382    .   G   C   282.77  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=4.635;DP=120;Dels=0.00;FS=0.000;HaplotypeScore=2.1301;MLEAC=1;MLEAF=0.500;MQ=40.95;MQ0=21;MQRankSum=-8.566;QD=2.36;ReadPosRankSum=-1.573;SOR=0.597;set=variant  GT:AD:DP:GQ:HP:PL:PQ:PB 0|1:52,68:120:99:28582863-1,28582863-2:311,0,1736:2056.19:P2400006
chrY    28583394    .   T   C   216.77  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=1.109;DP=104;Dels=0.00;FS=4.102;HaplotypeScore=1.5741;MLEAC=1;MLEAF=0.500;MQ=44.87;MQ0=13;MQRankSum=-7.970;QD=2.08;ReadPosRankSum=0.182;SOR=0.093;set=variant   GT:AD:DP:GQ:HP:PL:PQ:PB 0|1:55,49:104:99:28582863-1,28582863-2:245,0,1806:2193.86:P2400006
chrY    28583410    .   C   G   57.77   StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=-2.891;DP=75;Dels=0.00;FS=2.733;HaplotypeScore=6.3727;MLEAC=1;MLEAF=0.500;MQ=48.89;MQ0=5;MQRankSum=-6.686;QD=0.77;ReadPosRankSum=2.068;SOR=0.191;set=FilteredInAll  GT:AD:DP:GQ:PL:NB   0/1:48,27:75:86:86,0,1626:N2400008
chrY    28583415    .   T   C   52.77   StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=-4.835;DP=70;Dels=0.00;FS=2.861;HaplotypeScore=4.7985;MLEAC=1;MLEAF=0.500;MQ=49.54;MQ0=4;MQRankSum=-6.425;QD=0.75;ReadPosRankSum=1.709;SOR=0.173;set=FilteredInAll  GT:AD:DP:GQ:PL:NB   0/1:46,24:70:81:81,0,1597:N2400008
chrY    28583431    .   A   T   64.77   StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=3.238;DP=61;Dels=0.00;FS=0.000;HaplotypeScore=1.9913;MLEAC=1;MLEAF=0.500;MQ=47.63;MQ0=5;MQRankSum=-5.982;QD=1.06;ReadPosRankSum=-1.031;SOR=0.720;set=FilteredInAll  GT:AD:DP:GQ:PL:NB   0/1:38,23:61:93:93,0,1187:N2400009
chrY    28583432    rs77932961  A   G   64.77   StandardFilter  AC=1;AF=0.500;AN=2;BaseQRankSum=4.539;DB;DP=65;Dels=0.00;FS=0.000;HaplotypeScore=2.5696;MLEAC=1;MLEAF=0.500;MQ=48.48;MQ0=5;MQRankSum=-6.147;QD=1.00;ReadPosRankSum=-0.862;SOR=0.719;set=FilteredInAll   GT:AD:DP:GQ:PL:NB   0/1:40,24:65:93:93,0,1080:N2400009
chrY    28583487    rs78125809  C   T   119.77  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=-0.759;DB;DP=51;Dels=0.00;FS=2.660;HaplotypeScore=2.5672;MLEAC=1;MLEAF=0.500;MQ=51.24;MQ0=1;MQRankSum=-5.667;QD=2.35;ReadPosRankSum=-1.009;SOR=0.200;set=variant    GT:AD:DP:GQ:HP:PL:PQ:PB 0|1:34,17:51:99:28582863-1,28582863-2:148,0,1087:957.62:P2400006
chrY    28583500    rs76728162  C   T   169.77  PASS    AC=1;AF=0.500;AN=2;BaseQRankSum=-2.233;DB;DP=35;Dels=0.00;FS=0.000;HaplotypeScore=0.0000;MLEAC=1;MLEAF=0.500;MQ=49.99;MQ0=1;MQRankSum=-4.855;QD=4.85;ReadPosRankSum=-1.807;SOR=0.324;set=variant    GT:AD:DP:GQ:HP:PL:PQ:PB 0|1:21,14:35:99:28582863-1,28582863-2:198,0,704:1023.79:P2400006
ADD REPLYlink modified 4 weeks ago • written 4 weeks ago by valerie60

Are you sure the reference sequence used in the VCF was hg19 and not hg38 or something else?

ADD REPLYlink written 4 weeks ago by RamRS24k

Yes. Moreover, I tried using another example from the manual:

./table_annovar.pl patient.final.vcf  humandb/ -buildver hg19 -out myanno -remove -protocol refGene,cytoBand,exac03,avsnp147,dbnsfp30a -operation g,r,f,f,f -nastring . -vcfinput -polish

And I see that in the temporary file myanno.refGene.exonic_variant_function there is gene annotation! It is in awful format, but it exists. And 'myanno.invalid_input' is an empty file. This is crazy :)

ADD REPLYlink written 4 weeks ago by valerie60

That's really odd. I'd recommend picking the first 1000 lines of your VCF and trying the annovar command you used earlier, tweaking it until it works. The -protocol and -operation formats are pretty stupid, if you ask me.

In the meantime, you can always use VEP to get the annotations you need.

ADD REPLYlink written 29 days ago by RamRS24k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 969 users visited in the last hour