Question

Failed to open plain.vcf: unknown file type

1

Entering edit mode

3.4 years ago

obidobi ▴ 30

Hello!

I want to index, and do some other stuff with my .vcf.gz file.

1-) I wanted to index it and got an error:

[E::vcf_parse_format] Invalid character '.' in 'GQ' FORMAT

So, I unzipped the file, changed the header, bgzipped it.

gunzip -c my_file.vcf.gz | sed 's/^##fileformat=VCFv4.1/##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">/' | bgzip > 2my_file.vcf.gz && tabix -f -p vcf 2my_file.vcf.gz

2-) Then, I wanted to extract SNPs and do some other stuff via this script:

mkdir input 

bcftools norm -Ov -m-snps 2my_file.vcf.gz | bcftools norm -Ov --check-ref w -f reference.fasta > input/norm.vcf.gz

but got an error like:

Failed to open 2my_file.vcf.gz: unknown file type

and I checked the file type with htsfile and got:

2my_file.vcf.gz: SAM version 1 BGZF-compressed sequence data

So, I intended to do:

cp 2my_file.vcf.gz plain.vcf 

bcftools view -Oz -o compressed.vcf.gz plain.vcf 

htsfile compressed.vcf.gz plain.vcf

 bcftools index compressed.vcf.gz

but it stuck on bcftools view line and throws an error like:

Failed to open plain.vcf: unknown file type

So, what else I can do? I'll be appreciated if you can share your ideas! Thanks!

PS: The reason why I used fasta file with bcftools norm is it also throws an error with sample names, so I found a solution like this in case you wonder.

vcf bcftools vcftools bgzip • 6.6k views

ADD COMMENT • link updated 13 months ago by Ram 43k • written 3.4 years ago by obidobi ▴ 30

score 1 · Answer 1 · 2020-12-03

1

Entering edit mode

3.4 years ago

Jorge Amigo 14k

Your first sed substitutes the first fileformat line for the FORMAT one. Your

sed 's/^##fileformat=VCFv4.1/##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">/'

should be instead

sed 's/^##fileformat=VCFv4.1/##fileformat=VCFv4.1\n##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">/'

or even better this one in order to see your change at the bottom of the header

sed 's/^#CHROM/##FORMAT=<ID=GQ,Number=1,Type=Float,Description="Genotype Quality">\n#CHROM/'

ADD COMMENT • link 3.4 years ago by Jorge Amigo 14k

0

Entering edit mode

Hey!

Thanks, the first solution worked but now this:

Error at scaffold95:5127, the tag GP has wrong number of fields

Here is the line in interest from the header:

##FORMAT=<ID=GP,Number=G,Type=Integer,Description="Normalized, Phred-scaled posteriors for genotypes as defined in the VCF specification">

Do you have any idea about this too?

PS: The second solution gave this error in case you wonder:

[E::get_intv] Failed to parse TBX_VCF, was wrong -p [type] used?
The offending line was: "^#CHROM    POS ID  REF ALT QUAL    FILTER  INFO    FORMAT  ANG5    BCNBNC3 BER1    BS1 CD166   CHE100x CHI3b   CLO4    CR171   CR6-7   CRE2    CRE5    CTW6    CVB4    CVD8    FHL5    FOL4    GRV2    GRV7    HAM4    HAT4    HV1 ISL8    JAS5    LAY3    LAY5    LAY6    LYN4    MAD1    MAD36   MARB    MCK5    MOD8    MOH3    MTR3    NIC6    NOV4    OKG3    ORI8    ORO4    PAT3    PEN511  PRI3    PSK4    PT5 RDD7    RED1    ROV3    SCL9    SPRE2   SUN5    SW786   TAC8    UKI5    WHR7    WIL3    WLT2    WOO4"
#

ADD REPLY • link 3.4 years ago by obidobi ▴ 30

0

Entering edit mode

The error you're seeing is due to the content of that tag on a particular line, so if you find the offending line you'll be able to trace the error. The definition of the GP tag in the header seems fine, although I'm not used to that tag and it could be a problem if you define it as integer while the VCF format documentation describes it as float.

GP : the phred-scaled genotype posterior probabilities (and otherwise defined precisely as the GL field); intended to store imputed genotype probabilities (Floats)

Regarding the second solution, it just had that extra ^ at the beginning of the line in the second section of the sed substitution by mistake. I've corrected it.

ADD REPLY • link 3.4 years ago by Jorge Amigo 14k

0

Entering edit mode

Nailed it like:

gunzip -c my_file.vcf.gz | sed 's/^##fileformat=VCFv4.1/##fileformat=VCFv4.1\n##FORMAT=<id=gq,number=1,type=float,description="genotype quality"="">/'| sed 's/##FORMAT=<id=gp,number=g,type=integer,description="normalized, phred-scaled="" posteriors="" for="" genotypes="" as="" defined="" in="" the="" vcf="" specification"="">/##FORMAT=<id=gp,number=.,type=integer,description="normalized, phred-scaled="" posteriors="" for="" genotypes="" as="" defined="" in="" the="" vcf="" specification"="">/' | bgzip > 2my_file.vcf.gz && tabix -f -p vcf 2my_file.vcf.gz

I guess all issues related to "the wrong number of fields" can be fixed like changing Number=what_ever_it_is to Number=.

ADD REPLY • link 3.4 years ago by obidobi ▴ 30