Question

Inquiry related to vcf file and formatting

0

Entering edit mode

2.7 years ago

rheab1230 ▴ 140

Hello everyone,

I am trying to run predixcan software. But its showing error as segmentation fault implying that there is something wrong with my vcf files.

I am sharing the header of vcf file.

##fileformat=VCFv4.1
##INFO=<ID=LDAF,Number=1,Type=Float,Description="MLE Allele Frequency Accounting for LD">
##INFO=<ID=AVGPOST,Number=1,Type=Float,Description="Average posterior probability from MaCH/Thunder">
##INFO=<ID=RSQ,Number=1,Type=Float,Description="Genotype imputation quality from MaCH/Thunder">
##INFO=<ID=ERATE,Number=1,Type=Float,Description="Per-marker Mutation rate from MaCH/Thunder">
##INFO=<ID=THETA,Number=1,Type=Float,Description="Per-marker Transition rate from MaCH/Thunder">
##INFO=<ID=CIEND,Number=2,Type=Integer,Description="Confidence interval around END for imprecise variants">
##INFO=<ID=CIPOS,Number=2,Type=Integer,Description="Confidence interval around POS for imprecise variants">
##INFO=<ID=END,Number=1,Type=Integer,Description="End position of the variant described in this record">
##INFO=<ID=HOMLEN,Number=.,Type=Integer,Description="Length of base pair identical micro-homology at event breakpoints">
##INFO=<ID=HOMSEQ,Number=.,Type=String,Description="Sequence of base pair identical micro-homology at event breakpoints">
##INFO=<ID=SVLEN,Number=1,Type=Integer,Description="Difference in length between REF and ALT alleles">
##INFO=<ID=SVTYPE,Number=1,Type=String,Description="Type of structural variant">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Alternate Allele Count">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total Allele Count">
##ALT=<ID=DEL,Description="Deletion">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=DS,Number=1,Type=Float,Description="Genotype dosage from MaCH/Thunder">
##FORMAT=<ID=GL,Number=.,Type=Float,Description="Genotype Likelihoods">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele, ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/pilot_data/technical/reference/ancestral_alignments/README">
##INFO=<ID=AF,Number=1,Type=Float,Description="Global Allele Frequency based on AC/AN">
##INFO=<ID=AMR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AMR based on AC/AN">
##INFO=<ID=ASN_AF,Number=1,Type=Float,Description="Allele Frequency for samples from ASN based on AC/AN">
##INFO=<ID=AFR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from AFR based on AC/AN">
##INFO=<ID=EUR_AF,Number=1,Type=Float,Description="Allele Frequency for samples from EUR based on AC/AN">
##INFO=<ID=VT,Number=1,Type=String,Description="indicates what type of variant the line represents">
##INFO=<ID=SNPSOURCE,Number=.,Type=String,Description="indicates if a snp was called when analysing the low coverage or exome alignment data">
##reference=GRCh37
##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as predicted by VEP. Format: Allele|Gene|Feature|Feature_type|Consequence|cDNA_position|CDS_position|Protein_position|Amino_acids|Codons|Existing_variation|PolyPhen|SIFT|MOTIF_NAME|MOTIF_POS|HIGH_INF_POS|MOTIF_SCORE_CHANGE">
##INFO=<ID=A_A_CHANGE,Number=.,Type=String,Description="Geuvadis custom annotation array, amino acid change">
##INFO=<ID=A_A_LENGTH,Number=.,Type=Integer,Description="Geuvadis custom annotation array, number of amino acids in the peptide">
##INFO=<ID=A_A_POS,Number=.,Type=Integer,Description="Geuvadis custom annotation array, amino acid position in peptide">
##INFO=<ID=ANNOTATION_CLASS,Number=.,Type=String,Description="Geuvadis custom annotation array, annotation class">
##INFO=<ID=CELL,Number=.,Type=String,Description="Geuvadis custom annotation array, regulatory annotation cell type">
##INFO=<ID=CHROM_STATE,Number=.,Type=String,Description="Geuvadis custom annotation array, Encode ChromHMM">
##INFO=<ID=EXON_NUMBER,Number=.,Type=String,Description="Geuvadis custom annotation array, exon number/total exons">
##INFO=<ID=GENE_ID,Number=.,Type=String,Description="Geuvadis custom annotation array, Ensembl gene ID">
##INFO=<ID=GENE_NAME,Number=.,Type=String,Description="Geuvadis custom annotation array, gene name">
##INFO=<ID=HGVS,Number=.,Type=String,Description="Geuvadis custom annotation array, HGVS identifier for the variant">
##INFO=<ID=INTRON_NUMBER,Number=.,Type=String,Description="Geuvadis custom annotation array, intron number/total introns">
##INFO=<ID=MIRNA_MATURE_ID,Number=.,Type=String,Description="Geuvadis custom annotation array, miRBase mature miRNA ID">
##INFO=<ID=MIRNA_MATURE_NAME,Number=.,Type=String,Description="Geuvadis custom annotation array, miRBase miRNA mature name">
##INFO=<ID=MIRNA_PRECURSOR_ID,Number=.,Type=String,Description="Geuvadis custom annotation array, miRBase miRNA precursor ID">
##INFO=<ID=MIRNA_PRECURSOR_NAME,Number=.,Type=String,Description="Geuvadis custom annotation array, miRBase miRNA precursor name">
##INFO=<ID=MIRNA_STRAND,Number=.,Type=String,Description="Geuvadis custom annotation array, miRBase miRNA strand">
##INFO=<ID=MIRNA_TARGET,Number=.,Type=String,Description="Geuvadis custom annotation array, miRNA ID of miRNA target sites">
##INFO=<ID=MIRNA_TARGET_STRAND,Number=.,Type=String,Description="Geuvadis custom annotation array, strand of miRNA target site">
##INFO=<ID=POLYPHEN,Number=.,Type=String,Description="Geuvadis custom annotation array, polyphen category and score for the amino acid change">
##INFO=<ID=REG_ANNOTATION,Number=.,Type=String,Description="Geuvadis custom annotation array, Ensembl Regulatory Build AnnotatedFeature">
##INFO=<ID=SIFT,Number=.,Type=String,Description="Geuvadis custom annotation array, sift category and score for the amino acid change">
##INFO=<ID=TF_MAT,Number=.,Type=String,Description="Geuvadis custom annotation array, transcription factor matrix">
##INFO=<ID=TF_PWM_DELTA,Number=.,Type=String,Description="Geuvadis custom annotation array, transcription factor pwm change">
##INFO=<ID=TF_PWM_INFORM,Number=.,Type=String,Description="Geuvadis custom annotation array, transcription factor pwm information content">
##INFO=<ID=TF_PWM_POS,Number=.,Type=String,Description="Geuvadis custom annotation array, transcription factor pwm position">
##INFO=<ID=TRANSFAC,Number=.,Type=String,Description="Geuvadis custom annotation array, transcription factor name">
##INFO=<ID=TR_BIOTYPE,Number=.,Type=String,Description="Geuvadis custom annotation array, biotype of the transcript">
##INFO=<ID=TR_ID,Number=.,Type=String,Description="Geuvadis custom annotation array, transcript ID">
##INFO=<ID=TR_LENGTH,Number=.,Type=String,Description="Geuvadis custom annotation array, transcript length">
##INFO=<ID=TR_POS,Number=.,Type=String,Description="Geuvadis custom annotation array, position of variant in transcript">
##INFO=<ID=TR_STRAND,Number=.,Type=String,Description="Geuvadis custom annotation array, transcript strand">
##INFO=<ID=BP_TO_EXON,Number=.,Type=String,Description="Geuvadis custom annotation array, distance to exon boundary">
##INFO=<ID=EXON_NUMBER_NEAREST,Number=.,Type=String,Description="Geuvadis custom annotation array, Number of the nearest exon">
##INFO=<ID=A_A_TO_STOP,Number=.,Type=Integer,Description="Geuvadis custom annotation array, Number of peptides until next downstream stop">
##INFO=<ID=A_A_TO_START,Number=.,Type=Integer,Description="Geuvadis custom annotation array, next start codon">
##INFO=<ID=NMD,Number=.,Type=String,Description="Geuvadis custom annotation array, If the transcript is predicted to undergo NMD">
##INFO=<ID=MAINTAIN_FRAME,Number=.,Type=String,Description="Geuvadis custom annotation array, If exon skipping maintains frame">
##INFO=<ID=A_A_LENGTH_OLD_NEW,Number=.,Type=String,Description="Geuvadis custom annotation array, Original peptide length _ new peptide length">
##INFO=<ID=ANNOTATION_SUBCLASS,Number=.,Type=String,Description="Geuvadis custom annotation array, Annotation subtype">
##INFO=<ID=SEVERE_IMPACT,Number=.,Type=String,Description="Geuvadis custom annotation, the most severe annotation class">
##INFO=<ID=SEVERE_GENE,Number=.,Type=String,Description="Geuvadis custom annotation, gene of the most severe annotation">
##INFO=<ID=GENE_TRCOUNT_TOTAL,Number=.,Type=String,Description="Geuvadis custom annotation, number of transcripts in gene of the most severe annotation class">
##INFO=<ID=GENE_TRCOUNT_AFFECTED,Number=.,Type=String,Description="Geuvadis custom annotation, number of transcripts in gene that are affected by the most severe annotation class">
##INFO=<ID=LOF,Number=.,Type=String,Description="Geuvadis custom annotation, HC|LC for low/high confidence of the LOF call (high confidence if there are no lof_flags">
##INFO=<ID=LOF_FLAG,Number=.,Type=String,Description="Geuvadis custom annotation, LOF warning flags">
##INFO=<ID=TR_AFFECTED,Number=.,Type=String,Description="Geuvadis custom annotation, FULL if GENE_TRCOUNT_AFFECTED = GENE_TRCOUNT_TOTAL">
##INFO=<ID=ALLELE,Number=.,Type=String,Description="Geuvadis custom annotation, the annotated allele">
##INFO=<ID=DAF_GLOBAL,Number=.,Type=String,Description="Geuvadis custom annotation, global derived allele frequency in 1000g Phase1 data">
##INFO=<ID=GERP,Number=.,Type=String,Description="Geuvadis custom annotation, mammalian GERP score">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  HG00105 HG00107 HG00115 HG00132 HG00145 HG00157 HG00181 HG00308 HG00365 HG00371 HG00379 HG00380 HG01789 HG01790 HG01791 HG02215 NA06985 NA07346 NA11832 NA11840 NA11881 NA11918 NA12005 NA12156 NA12234 NA12760 NA12762 NA12776 NA12813 NA18488 NA19092 NA19141 NA19143

I tried to annotate and fix my vcf file using bcftools option. the command use is :

bcftools annotate  -x FORMAT --force GEUVADIS.chr22.genotype_updated2.vcf -Oz > GEUVADIS.chr22.genotype_updated_2.vcf

but I am getting this error.

[W::bcf_hdr_check_sanity] GL should be declared as Number=G
[W::vcf_parse] Contig '22' is not defined in the header. (Quick workaround: index the file with tabix.)
[W::vcf_parse_format] FORMAT 'PP' at 22:16050678 is not defined in the header, assuming Type=String
[W::vcf_parse_format] FORMAT 'BD' at 22:16050678 is not defined in the header, assuming Type=String
Warning: Encountered an error, proceeding only because --force was given.
         Note that this can result in a segfault or a silent corruption of the output file!
[E::vcf_format] Invalid BCF, CONTIG id=0 not present in the header
[main_vcfannotate] Error: failed to write to -

vcf bcftools • 1.2k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

Your VCF file is missing the section with contig lines. How was it generated?

ADD REPLY • link 2.7 years ago by Ram 43k

0

Entering edit mode

I downloaded the vvcf files from this site. https://www.ebi.ac.uk/arrayexpress/files/E-GEUV-1/GEUVADIS.chr1.PH1PH2_465.IMPFRQFILT_BIALLELIC_PH.annotv2.genotypes.vcf.gz

ADD REPLY • link 2.7 years ago by rheab1230 ▴ 140

0

Entering edit mode

These are per-chromosome files, but they should still contain one contig line. Send ArrayExpress an email and check with them. In the meantime, read the VCF specification and try adding a contig line with the appropriate contig name and see if that works.

ADD REPLY • link 2.7 years ago by Ram 43k

0

Entering edit mode

So I can develop contig file on my own? And the contig file would be same for all chromosome?