problem with vcf file
I have problem with vcf file, the problem is (-) in one or two columns in some lines.
I am trying to remove them or replace them but I couldn't do it. please could anyone help with that.
Thanks in advance,
Those offending lines are all "in-dels" and they are formatted in an old style where the context is not given. To reformat them correctly, you need to look at the corresponding position of genome and find out what are the bases in context. Alternatively, you may just filter out all the indels, if they are not critical for your analysis.
zgrep -v in-del vcf_chr_33.vcf.gz > chr33.snp.vcf
cat 00-All.vcf |sed -r 's|SERPINB10 CPOX|SERPINB10_CPOX|; s|SET domain containing 5|SETD5|;' >check_all.vcf
sed -e 's/SET domain containing /SETdomaincontaining/g' check_all.vcf > test252.vcf
tabix -p vcf test252.vcf.gz
vcf-sort Gallus_gallus.Gallus_gallus-5.0.dna.chromosome.1.dict test252.vcf.gz > dbsnp_sorted.vcf.gz
java -d64 -Xmx48g -jar /home/mbxao2/R-drive/tools/GATK/GenomeAnalysisTK.jar -T ValidateVariants -R Gallus_gallus.Gallus_gallus-5.0.dna.chromosome.1.fa -V dbsnp_sorted.vcf.gz --validationTypeToExclude ALL
at the final stage the error appear all the time.
Traffic: 2504 users visited in the last hour