Question: Probable bug in bcftools while parsing headers
0
gravatar for shubhra.bhattacharya
6 weeks ago by
shubhra.bhattacharya70 wrote:

Hello! I've been trying to generate consensus sequence from .bam file using bcftools and vcftools. The command for which is:

samtools-1.3/samtools mpileup -C50 -uf Eucalyptus_Gene_Sequences.fasta 133.bam.sorted.bam | bcftools-1.9/bcftools call -c | bcftools-1.9/vcfutils.pl vcf2fq -d 8 -D 100 > out.fastq

However it throws an error:

Could not parse the header line: "##contig=<ID=Eucgr.F02756.1_MYB52,BW52,length=1872>"
Could not parse the header line: "##contig=<ID=Eucgr.F03879.1_EMB14,EMB177,EMB33,SUS2,length=7745>"
Could not parse the header line: "##contig=<ID=Eucgr.F03935.1_EMB14,EMB177,EMB33,SUS2,length=7556>"
Could not parse the header line: "##contig=<ID=Eucgr.F04263.1_GUX3,PGSIP2,length=2250>"
Could not parse the header line: "##contig=<ID=Eucgr.G00889.1_REM39,VRN1,length=1879>"
Could not parse the header line: "##contig=<ID=Eucgr.G01977.1_GUT2,IRX10,length=2457>"

All of these contigs have a comma in their name and bcftools cannot parse these. What is a workaround this?

https://github.com/samtools/bcftools/issues/266 reports this but this has not been fixed in the latest version as well (1.9) .

--Thanks

bug bcftools • 145 views
ADD COMMENTlink written 6 weeks ago by shubhra.bhattacharya70

How have you constructed your Eucalyptus_Gene_Sequences.fasta file? Does this come from an Egrandis reference FASTA file that we can download (what is the URL?), and what processing (if any) did you do after downloading to make the Eucalyptus_Gene_Sequences.fasta you're using?

ADD REPLYlink written 6 weeks ago by John Marshall1.4k

Hello,

does it work if you quote the ID's in the header like this?

##contig=<ID="Eucgr.F02756.1_MYB52,BW52",length=1872>

fin swimmer

ADD REPLYlink written 6 weeks ago by finswimmer6.9k
1

Most likely yes, that will work with bcftools — but there is an extent to which that is a bcftools extension rather than something that is guaranteed in the format, which is vague about quoting strings. (So it may or may not help for other tools reading your VCF.) Moreover, we are moving to forbid commas in particular in reference contig names in SAM and VCF — see https://github.com/samtools/hts-specs/pull/333 et al, hence my question about where this file comes from.

Eucalyptus grandis reference files from ftp://ftp.plantgenie.org/Data/EucGenIE/Eucalyptus_grandis/v2.0/FASTA/ contain header lines like

>Eucgr.F02756.1 pacid=32056769 locus=Eucgr.F02756 ID=Eucgr.F02756.1.v2.0 annot-version=v2.0

which ought to lead to contig names like Eucgr.F02756.1 as the (vague) convention is to stop at whitespace when picking reference sequence names out of FASTA headers. It rather looks like the OP has instead converted spaces to underscores and fallen victim to the comma in MYB52,BW52 in what should have been informally a metadata field rather than part of the contig name. (But I haven't yet found a eucalyptus reference file with MYB52 in it, hence the question about where this comes from.)

So either the OP should be truncating the contig names at the space/underscore, or whoever generated the file should probably be converting the commas to underscores or some other character too. How much of a problem this is depends on whether it is the OP doing it locally, or the eucalyptus reference people doing it for reference files made publicly available. The SAM & VCF format folks would particularly like to know about the latter if it is the case, as that could affect this sensible small restriction in the formats.

ADD REPLYlink modified 5 weeks ago • written 5 weeks ago by John Marshall1.4k

Hello John Marshall ,

Moreover, we are moving to forbid commas in particular in reference contig names in SAM and VCF — see https://github.com/samtools/hts-specs/pull/333 et al

nice to see this. I absolutely agree with this idea. With vcf4.3 it already started that the INFOand 'FORMATtags must much the regex^[A-Za-z ][0-9A-Za-z .]*$`. It would be consequent to do this for contig names as well.

fin swimmer

ADD REPLYlink written 5 weeks ago by finswimmer6.9k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 826 users visited in the last hour