Question

bcftools consensus still returns "Could not parse the header" error

0

Entering edit mode

2.7 years ago

shpak.max ▴ 50

I attempted to create a consensus fasta file using bcftools, i.e.

bgzip -c All_SRR_SNP_Clean.vcf > All_SRR_SNP_Clean.vcf.gz
tabix All_SRR_SNP_Clean.vcf.gz
cat $ref| bcftools consensus $vcf_dir/All_SRR_SNP_Clean.vcf.gz > consensus.fasta

where $ref is the path to a Drosophila reference genome fa and the vcf was generated from an mpileup combining 4 different poolseq samples.

I get a parse error message:

[W::bcf_hdr_register_hrec] The type "FLoat" is not supported, assuming "String"
[W::bcf_hdr_parse] Could not parse header line: #CHROM       POS       ID            REF       ALT       QUAL       FILTER       INFO       FORMAT        SRR5647735.1.realign.bam  SRR8439151.1.realign.bam    SRR8439156.1.realign.bam
[E::bcf_hdr_parse] Could not parse the header, sample line not found
Failed to read from /home/mshpak/Lundflies/bams/unsorted/round1/VCF/Three_Files/All_SRR_SNP_Clean.vcf.gz: could not parse header

Several threads from 2-3 years ago referenced similar errors using bcftools, e.g.

Probable bug in bcftools while parsing headers

but they don't indicate a satisfactory resolution. As I have the most recent version of bcftools, it doesn't seem like the problem has been corrected, so is there a patch or work-around available?

bcftools samtools • 6.4k views

ADD COMMENT • link 2.7 years ago by shpak.max ▴ 50

0

Entering edit mode

please post header lines and if the header entries are huge in number, host the file some where. Try to address the issues like: [E::bcf_hdr_parse] Could not parse the header, sample line not found and also I do not understand this path :Failed to read from /home/mshpak/Lundflies/bams/unsorted/round1 /VCF/Three_Files/All_SRR_SNP_Clean.vcf.gz: could not parse header ( a gap between directories.. I am not sure if this is a typo or the input to bcftools is like that.

ADD REPLY • link 2.7 years ago by cpad0112 21k

0

Entering edit mode

The break in between directories in the path name was a formatting error in my post, not an error in the script.

The vcfs were generated using PoolSNP and are fairly standard in their format, e.g. commented lines followed by:

#CHROM       POS       ID       REF       ALT       QUAL       FILTER       INFO     FORMAT       SRR5647735.1.realign.bam  SRR8439151.1.realign.bam    SRR8439156.1.realign.bam
1   41319   .   C   T   .   .   ADP=18.666666666666668;NC=0 GT:RD:AD:DP:FREQ    0/1:14:3:17:0.18    0/1:18:6:24:0.25    0/1:7:8:15:0.53

where SRR...realign.bams are 3 source bam files for the mpileup that I used.

As far as I can tell, the vcf is in the standard format used by bcftools convert (rather than GATK's vcf format)

ADD REPLY • link 2.7 years ago by shpak.max ▴ 50

0

Entering edit mode

Check if this tutorial by @finswimmer on consensus generation by bcftools is helpful. Check your files. Still, if you are facing issues and you are confident that you are doing right and program is not behaving well, please reach out to developers. Devs for this tool are responsive and user friendly, IMO. With the data you furnished here, it is not possible to understand what is going on (for me).

ADD REPLY • link 2.7 years ago by cpad0112 21k

0

Entering edit mode

VCF is a bit odd in that those "commented lines" aren't comments! They are the headers it is complaining about.

Just because they've been produced using a standard tool doesn't necessarily mean they are correct. :-) It could be an error from either PoolSNP or Bcftools, but without being able to see the data it's impossible to tell where the problem lies.

For what it's worth "sample line not found" appears to be printed when it fails to find "#CHROM\tPOS", but it's a bit convoluted so it may also be a bail out from earlier parsing. (Also note the tab. Please double check it's a tab in your file and not spaces. That's not something we can tell in this medium.)

ADD REPLY • link 2.7 years ago by jkbonfield ★ 1.2k

0

Entering edit mode

I verified that the data fields are indeed delimited by /t rather than spaces, so something else may be wrong with the PoolSNP output format (I don't experience this issue when using bcftools on GATK-generated vcfs)

ADD REPLY • link 2.7 years ago by shpak.max ▴ 50