1, remove whitespace from column 8 ('INFO'), then zip via bgzip

Question

inquiry related to snpsift

0

Entering edit mode

2.7 years ago

rheab1230 ▴ 140

Hello everyone, I am trying to use snpsift to annotate my .vcf files using rsid in dbSNP. I am able to do so. But the output file is coming like this: 2 19477 snp_2_19477;rs138566047 Its not updating the id coloumn in .vcf files with rsid. But adding one more coloumn to the file with rsid value. Does anyone has any idea how to modify this?

rsid snp vcf • 5.0k views

ADD COMMENT • link updated 2.7 years ago by Kevin Blighe 87k • written 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

Hi, please show:

the command(s) that you are using
a sample of your input data
a sample of your output data
any status, warning, and/or error messages

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

the command:

java -jar SnpSift.jar annotate dbSnp144.vcf variants.vcf > variants_annotated.vcf

I am getting no error. the input file use is .vcf file for chr2.

2       10133   snp_2_10133     C       A       100.00  PASS    AA=.;AC=60;AF=0.03;AFR_AF=0.03;AMR_AF=0.01;AN=2184;ASN_AF=0.03;AVGPOST=0.8314;DAF_GLOBAL=.;ERATE=0.0124;EUR_AF=0.03;GERP=.;LDAF=0.1128;RSQ=0.2698;SNPSOURCE=LOWCOV;THETA=0.0141;VT=SNP;ANNOTATION_CLASS=ACTIVE_CHROM;CELL=GM12878;CHROM_STATE=14        GT:GL:DS:PP:BD  .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       0|0:-0.48,-0.48,-0.48:0.400:.:. 0|0:-0.00,-2.30,-5.00:0.000:.:. 0|0:-0.01,-1.87,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.400:.:. 0|0:-0.22,-0.42,-1.66:0.100:.:. 0|0:-0.48,-0.48,-0.48:0.800:.:. 0|0:-0.48,-0.48,-0.48:0.250:.:. 0|0:-0.00,-3.28,-5.00:0.000:.:. 0|0:-0.00,-2.26,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.48,-0.48,-0.48:0.500:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.00,-3.13,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.15,-0.54,-5.00:0.150:.:. 0|0:-0.48,-0.48,-0.48:0.400:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.100:.:. 0|0:-0.00,-3.11,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.250:.:. 0|0:-0.82,-0.40,-0.35:0.400:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.00,-2.92,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.00,-2.45,-5.00:0.000:.:. 0|0:-0.04,-1.02,-5.00:0.050:.:. 0|0:-0.01,-1.87,-5.00:0.000:.:. 0|0:-0.36,-0.41,-0.75:0.150:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.00,-2.85,-5.00:0.000:.:. 0|0:-0.00,-4.70,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.400:.:. 0|0:-0.48,-0.48,-0.48:0.100:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.48,-0.48,-0.48:0.050:.:. 0|0:0.00,-5.00,-5.00:0.000:.:.  0|0:-0.00,-3.55,

The output file is:

2       10133   snp_2_10133;rs187078949 C       A       100.0   PASS    AA=.;AC=60;AF=0.03;AFR_AF=0.03;AMR_AF=0.01;AN=2184;ASN_AF=0.03;AVGPOST=0.8314;DAF_GLOBAL=.;ERATE=0.0124;EUR_AF=0.03;GERP=.;LDAF=0.1128;RSQ=0.2698;SNPSOURCE=LOWCOV;THETA=0.0141;VT=SNP;ANNOTATION_CLASS=ACTIVE_CHROM;CELL=GM12878;CHROM_STATE=14;ASP;KGPhase1;RS=187078949;RSPOS=10133;SAO=0;SSR=0;VC=SNV;VP=0x050000000005000014000100;WGT=1;dbSNPBuildID=135  GT:GL:DS:PP:BD  .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       .:.:.:.:.       0|0:-0.48,-0.48,-0.48:0.400:.:. 0|0:-0.00,-2.30,-5.00:0.000:.:. 0|0:-0.01,-1.87,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.400:.:. 0|0:-0.22,-0.42,-1.66:0.100:.:. 0|0:-0.48,-0.48,-0.48:0.800:.:. 0|0:-0.48,-0.48,-0.48:0.250:.:. 0|0:-0.00,-3.28,-5.00:0.000:.:. 0|0:-0.00,-2.26,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.48,-0.48,-0.48:0.500:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.00,-3.13,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.15,-0.54,-5.00:0.150:.:. 0|0:-0.48,-0.48,-0.48:0.400:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.100:.:. 0|0:-0.00,-3.11,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.250:.:. 0|0:-0.82,-0.40,-0.35:0.400:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.00,-2.92,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.00,-2.45,

As you can see the in output file its adding rsid. but its still has previous ID from .vcf file snp_2_10133. What I want is to remove the id from .vcf file and update it to just rsid

ADD REPLY • link updated 2.7 years ago by Kevin Blighe 87k • written 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

I see - thanks. I am not sure that SnpSift annotate can do this, but you can first remove the original ID via bcftools, as follows:

bcftools annotate -x ID

There is no variant at that position, by the way - all entries are just 0|0 (?)

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

I actually just added one part of the input file. 8,-0.48:0.150:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.48,-0.48,-0.48:0.400:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.48,-0.48,-0.48:0.050:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.48,-0.48,-0.48:0.250:.:. 0|0:-0.48,-0.48,-0.48:0.100:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.36,-0.41,-0.76:0.250:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.48,-0.48,-0.48:0.250:.:. 0|0:-0.48,-0.48,-0.48:0.050:.:. 0|0:-0.48,-0.48,-0.48:0.400:.:. 0|0:-0.19,-0.46,-2.06:0.100:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.48,-0.48,-0.48:0.100:.:. 0|0:-0.48,-0.48,-0.48:0.450:.:. 0|0:-0.48,-0.48,-0.48:0.450:.:. 0|0:-0.48,-0.48,-0.48:0.250:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.250:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.100:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.35,-0.41,-0.78:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.050:.:. 0|0:-0.48,-0.48,-0.48:0.400:.:. 0|0:-0.48,-0.48,-0.48:0.450:.:. 0|1:-2.82,-0.45,-0.19:1.000:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.150:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 1|0:-2.38,-0.43,-0.20:1.100:.:. 0|0:-0.48,-0.48,-0.48:0.400:.:. 0|0:-0.02,-1.46,-5.00:0.000:.:. 0|0:-0.00,-4.10,-5.00:0.000:.:. 0|0:-0.00,-2.30,-5.00:0.000:.:. 0|1:-2.69,-0.44,-0.20:1.000:.:. 0|0:-0.00,-3.02,-5.00:0.000:.:. 0|0:-0.00,-3.32,-5.00:0.000:.:. 0|0:-0.05,-0.97,-4.22:0.000:.:. 0|0:-0.00,-2.98,-5.00:0.000:.:. 0|0:-0.00,-2.15,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.050:.:. 0|0:-0.477139,-0.477113,-0.477113:0.150:.:. 0|0:-0.00,-2.68,-5.00:0.000:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.050:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.48,-0.48,-0.48:0.300:.:. 0|0:-0.48,-0.48,-0.48:0.350:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.48:0.200:.:. 0|0:-0.48,-0.48,-0.4: It has some variants.

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

I use the above command as mentioned by you. the commands:

bcftools annotate -x ID file_chr2.vcf 

the error: [W::bcf_hdr_check_sanity] GL should be declared as Number=G [W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.) [W::vcf_parse_format] FORMAT 'PP' is not defined in the header, assuming Type=String [W::vcf_parse_format] FORMAT 'BD' is not defined in the header, assuming Type=String Encountered error, cannot proceed. Please check the error output above.

ADD REPLY • link updated 2.7 years ago by Kevin Blighe 87k • written 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

Hi again, have you tried to index the VCF as suggested?

Actually, you may need this sequence of commands:

bcftools annotate --remove FORMAT/GL -Oz file_chr2.vcf > file_chr2_fixGL.vcf.gz  ;
tabix -p vcf file_chr2_fixGL.vcf.gz ;
bcftools annotate -x ID -Ov file_chr2_fixGL.vcf.gz > file_chr2_fixGL_noID.vcf ;
java -jar SnpSift.jar annotate dbSnp144.vcf file_chr2_fixGL_noID.vcf

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi, I use the above command But I am getting error the error:

[W::bcf_hdr_check_sanity] GL should be declared as Number=G
[W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse_format] FORMAT 'PP' is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'BD' is not defined in the header, assuming Type=String
Encountered error, cannot proceed. Please check the error output above.
[tabix] was bgzip used to compress this file? file_chr2_fixGL.vcf.gz
Failed to open file_chr2_fixGL.vcf.gz: unknown file type

ADD REPLY • link updated 2.7 years ago by Kevin Blighe 87k • written 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

Please paste the entire header of the file file_chr2.vcf

hint: bcftools view -h file_chr2.vcf

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

i use the command;

bcftools view -h file_chr2.vcf

This is the output:

[W::bcf_hdr_check_sanity] GL should be declared as Number=G
##fileformat=VCFv4.1
##FILTER=<ID=PASS,Description="All filters passed">
##INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">
##INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">
##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">
##INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">
##INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Alternate Allele Count">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Allele Count">
##ALT=<ID=DEL,Description="Deletion">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage from MaCH/Thunder">
##FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihoods">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ancestral_alignments/README">
##INFO=<ID=AF,Number=1,Type=Float,Description="Global Allele Frequency based on AC/AN">
##INFO=<ID=AMR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AMR based on AC/AN">
##INFO=<ID=ASN_AF,Number=1,Type=Float,Description="Allele Frequency for samples from ASN based on AC/AN">
##INFO=<ID=AFR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AFR based on AC/AN">
##INFO=<ID=EUR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from EUR based on AC/AN">
##INFO=<ID=VT,Number=1,Type=String,Description="indicates what type of variant the line represents">
##INFO=<ID=SNPSOURCE,Number=.,Type=String,Description="indicates if a snp was called when analysing the low coverage or exome alignment data">
##reference=GRCh37
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|PolyPhen|SIFT|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE">
##INFO=<ID=A_A_CHANGE,Number=.,Type=String,Description="Geuvadis custom annotation array, amino acid change">
##INFO=<ID=A_A_LENGTH,Number=.,Type=Integer,Description="Geuvadis custom annotation array, number of amino acids in the peptide">
##INFO=<ID=A_A_POS,Number=.,Type=Integer,Description="Geuvadis custom annotation array, amino acid position in peptide">
##INFO=<ID=ANNOTATION_CLASS,Number=.,Type=String,Description="Geuvadis custom annotation array, annotation class">
##INFO=<ID=CELL,Number=.,Type=String,Description="Geuvadis custom annotation array, regulatory annotation cell type">
##INFO=<ID=CHROM_STATE,Number=.,Type=String,Description="Geuvadis custom annotation array, Encode ChromHMM">
##INFO=<ID=EXON_NUMBER,Number=.,Type=String,Description="Geuvadis custom annotation array, exon number/total exons">
##INFO=<ID=GENE_ID,Number=.,Type=String,Description="Geuvadis custom annotation array, Ensembl gene ID">
##INFO=<ID=GENE_NAME,Number=.,Type=String,Description="Geuvadis custom annotation array, gene name">
##INFO=<ID=HGVS,Number=.,Type=String,Description="Geuvadis custom annotation array, HGVS identifier for the variant">
##INFO=<ID=INTRON_NUMBER,Number=.,Type=String,Description="Geuvadis custom annotation array, intron number/total introns">
##INFO=<ID=MIRNA_MATURE_ID,Number=.,Type=String,Description="Geuvadis custom annotation array, miRBase mature miRNA ID">
##INFO=<ID=MIRNA_MATURE_NAME,Number=.,Type=String,Description="Geuvadis custom annotation array, miRBase miRNA mature name">
##INFO=<ID=MIRNA_PRECURSOR_ID,Number=.,Type=String,Description="Geuvadis custom annotation array, miRBase miRNA precursor ID">
##INFO=<ID=MIRNA_PRECURSOR_NAME,Number=.,Type=String,Description="Geuvadis custom annotation array, miRBase miRNA precursor name">
##INFO=<ID=MIRNA_STRAND,Number=.,Type=String,Description="Geuvadis custom annotation array, miRBase miRNA strand">
##INFO=<ID=MIRNA_TARGET,Number=.,Type=String,Description="Geuvadis custom annotation array, miRNA ID of miRNA target sites">
##INFO=<ID=MIRNA_TARGET_STRAND,Number=.,Type=String,Description="Geuvadis custom annotation array, strand of miRNA target site">
##INFO=<ID=POLYPHEN,Number=.,Type=String,Description="Geuvadis custom annotation array, polyphen category and score for the amino acid change">
##INFO=<ID=REG_ANNOTATION,Number=.,Type=String,Description="Geuvadis custom annotation array, Ensembl Regulatory Build AnnotatedFeature">
##INFO=<ID=SIFT,Number=.,Type=String,Description="Geuvadis custom annotation array, sift category and score for the amino acid change">
##INFO=<ID=TF_MAT,Number=.,Type=String,Description="Geuvadis custom annotation array, transcription factor matrix">
##INFO=<ID=TF_PWM_DELTA,Number=.,Type=String,Description="Geuvadis custom annotation array, transcription factor pwm change">
##INFO=<ID=TF_PWM_INFORM,Number=.,Type=String,Description="Geuvadis custom annotation array, transcription factor pwm information content">
##INFO=<ID=TF_PWM_POS,Number=.,Type=String,Description="Geuvadis custom annotation array, transcription factor pwm position">
##INFO=<ID=TRANSFAC,Number=.,Type=String,Description="Geuvadis custom annotation array, transcription factor name">
##INFO=<ID=TR_BIOTYPE,Number=.,Type=String,Description="Geuvadis custom annotation array, biotype of the transcript">
##INFO=<ID=TR_ID,Number=.,Type=String,Description="Geuvadis custom annotation array, transcript ID">
##INFO=<ID=TR_LENGTH,Number=.,Type=String,Description="Geuvadis custom annotation array, transcript length">
##INFO=<ID=TR_POS,Number=.,Type=String,Description="Geuvadis custom annotation array, position of variant in transcript">
##INFO=<ID=TR_STRAND,Number=.,Type=String,Description="Geuvadis custom annotation array, transcript strand">
##INFO=<ID=BP_TO_EXON,Number=.,Type=String,Description="Geuvadis custom annotation array, distance to exon boundary">
##INFO=<ID=EXON_NUMBER_NEAREST,Number=.,Type=String,Description="Geuvadis custom annotation array, Number of the nearest exon">
##INFO=<ID=A_A_TO_STOP,Number=.,Type=Integer,Description="Geuvadis custom annotation array, Number of peptides until next downstream stop">
##INFO=<ID=A_A_TO_START,Number=.,Type=Integer,Description="Geuvadis custom annotation array, next start codon">
##INFO=<ID=NMD,Number=.,Type=String,Description="Geuvadis custom annotation array, If the transcript is predicted to undergo NMD">
##INFO=<ID=MAINTAIN_FRAME,Number=.,Type=String,Description="Geuvadis custom annotation array, If exon skipping maintains frame">
##INFO=<ID=A_A_LENGTH_OLD_NEW,Number=.,Type=String,Description="Geuvadis custom annotation array, Original peptide length _ new peptide length">
##INFO=<ID=ANNOTATION_SUBCLASS,Number=.,Type=String,Description="Geuvadis custom annotation array, Annotation subtype">
##INFO=<ID=SEVERE_IMPACT,Number=.,Type=String,Description="Geuvadis custom annotation, the most severe annotation class">
##INFO=<ID=SEVERE_GENE,Number=.,Type=String,Description="Geuvadis custom annotation, gene of the most severe annotation">
##INFO=<ID=GENE_TRCOUNT_TOTAL,Number=.,Type=String,Description="Geuvadis custom annotation, number of transcripts in gene of the most severe annotation class">
##INFO=<ID=GENE_TRCOUNT_AFFECTED,Number=.,Type=String,Description="Geuvadis custom annotation, number of transcripts in gene that are affected by the most severe annotation class">
##INFO=<ID=LOF,Number=.,Type=String,Description="Geuvadis custom annotation, HC|LC for low/high confidence of the LOF call (high confidence if there are no lof_flags">
##INFO=<ID=LOF_FLAG,Number=.,Type=String,Description="Geuvadis custom annotation, LOF warning flags">
##INFO=<ID=TR_AFFECTED,Number=.,Type=String,Description="Geuvadis custom annotation, FULL if GENE_TRCOUNT_AFFECTED = GENE_TRCOUNT_TOTAL">
##INFO=<ID=ALLELE,Number=.,Type=String,Description="Geuvadis custom annotation, the annotated allele">
##INFO=<ID=DAF_GLOBAL,Number=.,Type=String,Description="Geuvadis custom annotation, global derived allele frequency in 1000g Phase1 data">
##INFO=<ID=GERP,Number=.,Type=String,Description="Geuvadis custom annotation, mammalian GERP score">
##bcftools_viewVersion=1.9+htslib-1.9
##bcftools_viewCommand=view -h GEUVADIS.chr2.genotype.vcf; Date=Sun Jul 25 13:55:19 2021
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00105 HG00107 HG00115 HG00132 HG00145 HG00157 HG00181 HG00308 HG00365 HG00371 HG00379 HG00380HG01789  HG01790 HG01791 HG02215 NA06985 NA07346 NA11832 NA11840 NA11881 NA11918 NA12005 NA12156 NA12234 NA12760 NA12762 NA12776 NA12813

ADD REPLY • link updated 2.7 years ago by Kevin Blighe 87k • written 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

the command:

bcftools annotate --remove FORMAT/GL file_chr2.vcf | head

the output: [W::bcf_hdr_check_sanity] GL should be declared as Number=G [W::vcf_parse] Contig '2' is not defined in the header. (Quick workaround: index the file with tabix.)

fileformat=VCFv4.1

FILTER=<ID=PASS,Description="All filters passed">

INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">

INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">

INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">

INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">

INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">

INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">

INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">

INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

[W::vcf_parse_format] FORMAT 'PP' is not defined in the header, assuming Type=String [W::vcf_parse_format] FORMAT 'BD' is not defined in the header, assuming Type=String Encountered error, cannot proceed. Please check the error output above.

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

Can you please make your file available somewhere so that I can download it and try it myself?

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

the command: bgzip file_chr2.vcf ; tabix -p vcf file_chr2.vcf.gz ;

the output: unrecognized preset file_chr2.vcf

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

Sure. this is the link to download the .vcf file for chr2. curl -o $SCRATCH/GEUVADIS/vcf_files/orig_vcfs/GEUVADIS.chr2.genotype.vcf.gz https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/GEUVADIS.chr2.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

Kevin Blighe · Answer 1 · 2021-07-25

1

Entering edit mode

2.7 years ago

Kevin Blighe 87k

Thanks very much for providing the VCF that you're using. For others, it's

https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/GEUVADIS.chr2.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz

This VCF is corrupt and does not conform to the VCF specification. It has the following issues:

whitespace in 'INFO' column
contig '2' not defined in header
'FORMAT/GL' should be declared as Number
'FORMAT/PP' not defined in header
'FORMAT/BD' not defined in header

I was able to fix the VCF with these commands (below). Unfortunately, the 'FORMAT' field is a complete mess, so, I made an 'executive' decision to remove it, leaving just 'FORMAT/GT'. This loses some info, but leaves you with a validated VCF for anything else that you may want to do.

1, remove whitespace from column 8 ('INFO'), then zip via bgzip

zcat GEUVADIS.chr2.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz |\
  sed 's/ damaging/_damaging/g' |\
  bgzip > test.vcf.gz ;

2, tab-index (fixes contig '2' issue)

tabix -p vcf test.vcf.gz ;

3, BCFtools to remove all 'FORMAT' tags except `GT`

bcftools annotate -x 'FORMAT' --force test.vcf.gz -Oz > test.fixed.vcf.gz ;

This will initially show the warnings relating to 'FORMAT', but the use of --force allows us to skip these warnings. Also, by removing the problematic 'FORMAT' tags, we avoid the subsequent segmentation fault that occurs.

4, BCFtools to remove 'ID' field

bcftools annotate -x ID test.fixed.vcf.gz -Oz > test.fixed.noID.vcf.gz ;

5, SnpSift

java -jar SnpSift.jar annotate dbSnp144.vcf test.fixed.noID.vcf.gz

Kevin

ADD COMMENT • link 2.7 years ago by Kevin Blighe 87k

1

Entering edit mode

I have just properly tested these on the complete VCF, and they work. Can you verify in your own time? - thanks.

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Hello kevin, thank you so much for such a great detail reply. I am doing this step but its showing error. the command: bcftools annotate -x 'FORMAT' -force test.vcf.gz -Oz > test.fixed.vcf.g the error: annotate: unrecognized option '--force'

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

Hi, there should be 2 hyphens there: --force (- - force)

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Hello kevin, I am writing this command:

bcftools annotate -x 'FORMAT' - - force test.vcf.gz -Oz > test.fixed.vcf.g

but its taking a lot of time. Its been almost 18-20 hours and still the output hasn't been made.

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

Hey, oh, no, that should be:

bcftools annotate -x 'FORMAT' --force test.vcf.gz -Oz > test.fixed.vcf.gz

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

hey, when I am using the above command the command:

bcftools annotate -x 'FORMAT' --force test.vcf.gz -Oz > test.fixed.vcf.gz

I am getting an error: annotate: unrecognized option '--force

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

That is strange. Is --force listed when you just run this:

bcftools annotate

?

Which version are you using? - type bcftools --version

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

I am using version bcftools 1.9 and htslib 1.9 And also when i use bcftools annotate --format function is not lister

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

I am using the same version, and --force is listed:

bcftools annotate 

About:   Annotate and edit VCF/BCF files.
Usage:   bcftools annotate [options] <in.vcf.gz>

Options:
   -a, --annotations <file>       VCF file or tabix-indexed file with annotations: CHR\tPOS[\tVALUE]+
       --collapse <string>        matching records by <snps|indels|both|all|some|none>, see man page for details [some]
   -c, --columns <list>           list of columns in the annotation file, e.g. CHROM,POS,REF,ALT,-,INFO/TAG. See man page for details
   -e, --exclude <expr>           exclude sites for which the expression is true (see man page for details)
       --force                    continue despite parsing error (at your own risk!)
   -h, --header-lines <file>      lines which should be appended to the VCF header
   -I, --set-id [+]<format>       set ID column, see man page for details
   -i, --include <expr>           select sites for which the expression is true (see man page for details)
   -k, --keep-sites               leave -i/-e sites unchanged instead of discarding them
   -l, --merge-logic <tag:type>   merge logic for multiple overlapping regions (see man page for details), EXPERIMENTAL
   -m, --mark-sites [+-]<tag>     add INFO/tag flag to sites which are ("+") or are not ("-") listed in the -a file
       --no-version               do not append version and command line to the header
   -o, --output <file>            write output to a file [standard output]
   -O, --output-type <b|u|z|v>    b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
   -r, --regions <region>         restrict to comma-separated list of regions
   -R, --regions-file <file>      restrict to regions listed in a file
       --rename-chrs <file>       rename sequences according to map file: from\tto
   -s, --samples [^]<list>        comma separated list of samples to annotate (or exclude with "^" prefix)
   -S, --samples-file [^]<file>   file of samples to annotate (or exclude with "^" prefix)
   -x, --remove <list>            list of annotations (e.g. ID,INFO/DP,FORMAT/DP,FILTER) to remove (or keep with "^" prefix). See man page for details
       --threads <int>            number of extra output compression threads [0]


bcftools --version
bcftools 1.9-152-g27033c6
Using htslib 1.9-149-gf5b75ff
Copyright (C) 2018 Genome Research Ltd.
License Expat: The MIT/Expat license
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

In my case its not mentioned.

 -a, --annotations <file>       VCF file or tabix-indexed file with annotations: CHR\tPOS[\tVALUE]+
       --collapse <string>        matching records by <snps|indels|both|all|some|none>, see man page for details [some]
   -c, --columns <list>           list of columns in the annotation file, e.g. CHROM,POS,REF,ALT,-,INFO/TAG. See man page for details
   -e, --exclude <expr>           exclude sites for which the expression is true (see man page for details)
   -h, --header-lines <file>      lines which should be appended to the VCF header
   -I, --set-id [+]<format>       set ID column, see man page for details
   -i, --include <expr>           select sites for which the expression is true (see man page for details)
   -k, --keep-sites               leave -i/-e sites unchanged instead of discarding them
   -m, --mark-sites [+-]<tag>     add INFO/tag flag to sites which are ("+") or are not ("-") listed in the -a file
       --no-version               do not append version and command line to the header
   -o, --output <file>            write output to a file [standard output]
   -O, --output-type <b|u|z|v>    b: compressed BCF, u: uncompressed BCF, z: compressed VCF, v: uncompressed VCF [v]
   -r, --regions <region>         restrict to comma-separated list of regions
   -R, --regions-file <file>      restrict to regions listed in a file
       --rename-chrs <file>       rename sequences according to map file: from\tto
   -s, --samples [^]<list>        comma separated list of samples to annotate (or exclude with "^" prefix)
   -S, --samples-file [^]<file>   file of samples to annotate (or exclude with "^" prefix)
   -x, --remove <list>            list of annotations (e.g. ID,INFO/DP,FORMAT/DP,FILTER) to remove (or keep with "^" prefix). See man page for details
       --threads <int>

ADD REPLY • link updated 2.7 years ago by Kevin Blighe 87k • written 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

Interesting, can you confirm the exact version 1.9 build, basically, please show the output of bcftools --version? Also, which operating system are you using?

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Hey Kevin, its working now. Earlier I use conda to install the software. But that is not working. So I went to bcftools site and downloaded the software. and now when I do bcftools annotate --force option is coming as a parameter. Thank you so much for your help.

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

I have one question related to earlier step where you are formating the .vcf file. You are removing white space and removing format tags except GT. Is there any specific format for header files for vcf? Also how do we know when is the header file wrong? and how do we modify that so that we do not remove necessary information from .vcf files?

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

As with many things in bioinformatics, the VCF specification has been misused over the years. Take into account, also, that different programs check for different things when parsing / reading a VCF. For example, we see this in your case, whereby BCFtools encountered warnings and errors when reading your VCF, while SnpSift tolerated these.

You can regard this as the 'ground truth' of what should be in a 'minimal' VCF: https://samtools.github.io/hts-specs/VCFv4.2.pdf

By minimal, we mean the minimum amount of information [and in which format] such that it meets the VCF specification.

In this case, the FORMAT field in your VCF was a complete mess. Not your fault, of course. The information that we eliminated was unimportant, being just the GL, DS, PP, and BD tags. GT is most important.

Oh, regarding the whitespace, we are actually just replacing all instances of ' damaging' with '_damaging'. This was what caused that particular issue.

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

1

Entering edit mode

ohh okay. now i get it. Thank you so much Kevin for helping me out.

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

Hi kevin, I have one more doubt. When I am annotating vcf file with rsid. some of them are not getting rsid. like this: 4 53498 . A C 100.0 PASS In case of rsid its showing one dot. Should I use some other dbSNP file probably to annotate?

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

Hi rheab, this is hg19 / GRCh37, right? There is no known SNP at this position; so, you can consider this variant as 'novel' (new). You can check for updated information at UCSC Genome Browser: https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg19&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr4%3A53498%2D53498&hgsid=1135020491_Hv9A4wACeqSQbQINUU3ItFLnaCpA

ADD REPLY • link 2.7 years ago by Kevin Blighe 87k

0

Entering edit mode

Yes, Its hg19. Ohh okay. So that means if I don't get any rsid in that position I should check genome browser to see if there is any known SNP present at that position or not.

fileformat=VCFv4.1

FILTER=<ID=PASS,Description="All filters passed">

INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">

INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">

INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">

INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">

INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">

INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">

INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">

INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">

1, remove whitespace from column 8 ('INFO'), then zip via bgzip

2, tab-index (fixes contig '2' issue)

3, BCFtools to remove all 'FORMAT' tags except GT

4, BCFtools to remove 'ID' field

5, SnpSift

3, BCFtools to remove all 'FORMAT' tags except `GT`