My VCF file structure is strange and less compatible to data processing programs.
0
0
Entering edit mode
11 months ago
kgwkk2 • 0

This is normal vcf header structure.

## [1] "##fileformat=VCFv4.1"
## [1] "##source=\"GATK haplotype Caller, phased with beagle4\""
## [1] "##FILTER=<ID=LowQual,Description=\"Low quality\">"
## [1] "##FORMAT=<ID=AD,Number=.,Type=Integer,Description=\"Allelic depths fo [Truncated]"
## [1] "##FORMAT=<ID=DP,Number=1,Type=Integer,Description=\"Approximate read  [Truncated]"
## [1] "##FORMAT=<ID=GQ,Number=1,Type=Integer,Description=\"Genotype Quality\">"
## [1] "First 6 rows."
## [1] 
## [1] "***** Fixed section *****"
##      CHROM              POS   ID REF ALT QUAL     FILTER
## [1,] "Supercontig_1.50" "2"   NA "T" "A" "44.44"  NA    
## [2,] "Supercontig_1.50" "246" NA "C" "G" "144.21" NA    
## [3,] "Supercontig_1.50" "549" NA "A" "C" "68.49"  NA    
## [4,] "Supercontig_1.50" "668" NA "G" "C" "108.07" NA    
## [5,] "Supercontig_1.50" "765" NA "A" "C" "92.78"  NA    
## [6,] "Supercontig_1.50" "780" NA "G" "T" "58.38"  NA    
## [1] 
## [1] "***** Genotype section *****"
##      FORMAT           BL2009P4_us23               DDR7602                  
## [1,] "GT:AD:DP:GQ:PL" "0|0:62,0:62:99:0,190,2835" "0|0:12,0:12:39:0,39,585"
## [2,] "GT:AD:DP:GQ:PL" "1|0:5,5:10:99:111,0,114"   NA                       
## [3,] "GT:AD:DP:GQ:PL" NA                          NA                       
## [4,] "GT:AD:DP:GQ:PL" "0|0:1,0:1:3:0,3,44"        NA                       
## [5,] "GT:AD:DP:GQ:PL" "0|0:2,0:2:6:0,6,49"        "0|0:1,0:1:3:0,3,34"     
## [6,] "GT:AD:DP:GQ:PL" "0|0:2,0:2:6:0,6,49"        "0|0:1,0:1:3:0,3,34"     
##      IN2009T1_us22               LBUS5                     NL07434             
## [1,] "0|0:37,0:37:99:0,114,1709" "0|0:12,0:12:39:0,39,585" NA                  
## [2,] "0|1:2,1:3:16:16,0,48"      NA                        NA                  
## [3,] "0|0:2,0:2:6:0,6,51"        NA                        NA                  
## [4,] "1|1:0,1:1:3:25,3,0"        NA                        "0|0:1,0:1:3:0,3,28"
## [5,] "0|0:1,0:1:3:0,3,31"        "0|0:1,0:1:3:0,3,34"      "0|0:1,0:1:3:0,3,26"
## [6,] "0|0:3,0:3:9:0,9,85"        "0|0:1,0:1:3:0,3,34"      NA                  
## [1] "First 6 columns only."

But this is my VCF file. Though it is multi calling VCF, I think it is too weird and too long information. Info section and genotype section are also not normal.

I just used illumina fastq data and used programs as BWA, SAMtools, HaplotypeCaller, GenotypeGVCFs, SelectVariants, and VariantFiltration. Is this normal condition? Because almost half of the tools using input file as vcf got error when I run with this file. Please inform me what is the problem.

##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##ALT=<ID=NON_REF,Description="Represents any possible alternative allele not already represented at this location by REF and ALT">
##FILTER=<ID=FILTER,Description="QD<2.0||((MQ<40.0||RankSum<-12.5||ReadPosRankSum<-8.0||FS>60.0||SOR>3.0)&&TYPE='snp')||((ReadPosRankSum<-20.0||FS>200.0||SOR>10.0)&&TYPE='indel')">
##FILTER=<ID=LowQual,Description="Low quality">
##FORMAT=<ID=AD,Number=R,Type=Integer,Description="Allelic depths for the ref and alt alleles in the order listed">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Approximate read depth (reads with MQ=255 or with bad mates are filtered)">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=MIN_DP,Number=1,Type=Integer,Description="Minimum DP observed within the GVCF block">
##FORMAT=<ID=PGT,Number=1,Type=String,Description="Physical phasing haplotype information, describing how the alternate alleles are phased in relation to one another; will always be heterozygous and is not intended to describe called alleles">
##FORMAT=<ID=PID,Number=1,Type=String,Description="Physical phasing ID information, where each unique ID within a given sample (but not across samples) connects records within a phasing group">
##FORMAT=<ID=PL,Number=G,Type=Integer,Description="Normalized, Phred-scaled likelihoods for genotypes as defined in the VCF specification">
##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phasing set (typically the position of the first variant in the set)">
##FORMAT=<ID=RGQ,Number=1,Type=Integer,Description="Unconditional reference genotype confidence, encoded as a phred quality -10*log10 p(genotype call is wrong)">
##FORMAT=<ID=SB,Number=4,Type=Integer,Description="Per-sample component statistics which comprise the Fisher's Exact Test to detect strand bias.">
##GATKCommandLine=<ID=GenotypeGVCFs,CommandLine="GenotypeGVCFs --output 100-8-30h.geno.vcf --variant 100-8-30h.vcf --reference NRRL3357.fa --include-non-variant-sites false --merge-input-intervals false --input-is-somatic false --tumor-lod-to-emit 3.5 --allele-fraction-error 0.001 --keep-combined-raw-annotations false --use-posteriors-to-calculate-qual false --dont-use-dragstr-priors false --use-new-qual-calculator true --annotate-with-num-discovered-alleles false....

(omit....)

=Float,Description="Symmetric Odds Ratio of 2x2 contingency table to detect strand bias">
##contig=<ID=NC_054691.1,length=6386556>
##contig=<ID=NC_054692.1,length=6246150>
##contig=<ID=NC_054693.1,length=5100955>
##contig=<ID=NC_054694.1,length=4658713>
##contig=<ID=NC_054695.1,length=4453722>
##contig=<ID=NC_054696.1,length=3936580>
##contig=<ID=NC_054697.1,length=3033036>
##contig=<ID=NC_054698.1,length=3179870>
##source=GenotypeGVCFs
##source=HaplotypeCaller
##source=VariantFiltration
##bcftools_mergeVersion=1.16-7-gf4dee4b+htslib-1.16-11-ga1dec95
##bcftools_mergeCommand=merge --no-index -o Merged.vcf.gz 100-8-30h.filtered.vcf.gz NRRL30797.filtered.vcf.gz 100-8-36h.filtered.vcf.gz NRRL35739.filtered.vcf.gz 100-8-42h.filtered.vcf.gz RIB537.filtered.vcf.gz 14160.filtered.vcf.gz ... (omit, multi calling samples name)....  MWX2.filtered.vcf.gz Yazoo-S2.filtered.vcf.gz; Date=Wed May  3 16:57:10 2023
#CHROM  POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  100-8-30h   NRRL30797   100-8-36h   NRRL35739   100-8-42h   RIB537  14160   RIB949  2017-Washington-T2  SD016   2017-Washington-T5  SD022   3-042-30h   SD035   3-042-36h   SD039   3-042-42h   SD061   A1  SD24    A9  SD45    AF36    SD59    AR018   SL005   AR028   SL01    Afla-Guard  SL015   Aor-06  SL034   Aor-17  SL041   Aor-34  SL044   Aor-38  SL055   BP2-1   SL08    CA14    SL46    CF1 SU-16   CF2 SW1 CF3 TK-1    E1402   TK-10   E1404   TK-11   E1406   TK-12   E1445   TK-13   HK1 TK-14   K54A    TK-15   K93210  TK-2    M2040   TK-20   MRI19   TK-24   MWA1    TK-26   MWA2    TK-4    MWA3    TK-5    MWB1    TK-59   MWB2    TK-60   MWB3    TK-7    MWC1    TK-9    MWC2    Tox4    MWC3    VCG1    MWX1    WRRL1519    MWX2    Yazoo-S2
NC_054691.1 59  .   G   T   34.64   PASS    BaseQRankSum=0.088;ExcessHet=3.0103;FS=4.506;MQ=60;MQRankSum=0;QD=2.16;ReadPosRankSum=-1.512;SOR=0.16;DP=23;AF=0.5;MLEAC=1;MLEAF=0.5;AN=2;AC=1  GT:AD:DP:GQ:PGT:PID:PL:PS   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   0|1:14,2:21:42:0|1:59_G_T:42,0,576:59   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.   ./.:.:.:.:.:.:.:.
VCF GATK SNPs • 675 views
ADD COMMENT
1
Entering edit mode

Because almost half of the tools using input file as vcf got error when I run with this file.

which error ?

ADD REPLY
0
Entering edit mode

It's hard to explain exact error name because most of them are syntax error or value error. For example, when I run script to run PCA on SNPs data from a vcf file (https://rpubs.com/madisondougherty/980777), it get error because there's a number that can't come out of the formula. (Error in apply(x, 2, sd, na.rm = TRUE) : dim(X) must have a positive length)

I know it's hard to observe the mistaken part, but can you get any strange things about the sequence of Info (DP, AC, ExcessHet and so on) category or others

ADD REPLY
0
Entering edit mode

that's not a vcf parsing error

ADD REPLY
0
Entering edit mode

run your VCF file through bcftools view if it passes through that then your VCF is likely valid.

but it may not contain information some other tools wants, but that is a different problem altogether

ADD REPLY

Login before adding your answer.

Traffic: 3310 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6