Probable bug in bcftools while parsing headers
0
0
Entering edit mode
3.0 years ago

Hello! I've been trying to generate consensus sequence from .bam file using bcftools and vcftools. The command for which is:

samtools-1.3/samtools mpileup -C50 -uf Eucalyptus_Gene_Sequences.fasta 133.bam.sorted.bam | bcftools-1.9/bcftools call -c | bcftools-1.9/vcfutils.pl vcf2fq -d 8 -D 100 > out.fastq


However it throws an error:

Could not parse the header line: "##contig=<ID=Eucgr.F02756.1_MYB52,BW52,length=1872>"
Could not parse the header line: "##contig=<ID=Eucgr.F03879.1_EMB14,EMB177,EMB33,SUS2,length=7745>"
Could not parse the header line: "##contig=<ID=Eucgr.F03935.1_EMB14,EMB177,EMB33,SUS2,length=7556>"
Could not parse the header line: "##contig=<ID=Eucgr.F04263.1_GUX3,PGSIP2,length=2250>"
Could not parse the header line: "##contig=<ID=Eucgr.G00889.1_REM39,VRN1,length=1879>"
Could not parse the header line: "##contig=<ID=Eucgr.G01977.1_GUT2,IRX10,length=2457>"


All of these contigs have a comma in their name and bcftools cannot parse these. What is a workaround this?

https://github.com/samtools/bcftools/issues/266 reports this but this has not been fixed in the latest version as well (1.9) .

--Thanks

bug bcftools • 2.0k views
0
Entering edit mode

How have you constructed your Eucalyptus_Gene_Sequences.fasta file? Does this come from an Egrandis reference FASTA file that we can download (what is the URL?), and what processing (if any) did you do after downloading to make the Eucalyptus_Gene_Sequences.fasta you're using?

0
Entering edit mode

Hello,

does it work if you quote the ID's in the header like this?

##contig=<ID="Eucgr.F02756.1_MYB52,BW52",length=1872>


fin swimmer

1
Entering edit mode

Most likely yes, that will work with bcftools — but there is an extent to which that is a bcftools extension rather than something that is guaranteed in the format, which is vague about quoting strings. (So it may or may not help for other tools reading your VCF.) Moreover, we are moving to forbid commas in particular in reference contig names in SAM and VCF — see https://github.com/samtools/hts-specs/pull/333 et al, hence my question about where this file comes from.

Eucalyptus grandis reference files from ftp://ftp.plantgenie.org/Data/EucGenIE/Eucalyptus_grandis/v2.0/FASTA/ contain header lines like

>Eucgr.F02756.1 pacid=32056769 locus=Eucgr.F02756 ID=Eucgr.F02756.1.v2.0 annot-version=v2.0


which ought to lead to contig names like Eucgr.F02756.1 as the (vague) convention is to stop at whitespace when picking reference sequence names out of FASTA headers. It rather looks like the OP has instead converted spaces to underscores and fallen victim to the comma in MYB52,BW52 in what should have been informally a metadata field rather than part of the contig name. (But I haven't yet found a eucalyptus reference file with MYB52 in it, hence the question about where this comes from.)

So either the OP should be truncating the contig names at the space/underscore, or whoever generated the file should probably be converting the commas to underscores or some other character too. How much of a problem this is depends on whether it is the OP doing it locally, or the eucalyptus reference people doing it for reference files made publicly available. The SAM & VCF format folks would particularly like to know about the latter if it is the case, as that could affect this sensible small restriction in the formats.

0
Entering edit mode

Hello John Marshall ,

Moreover, we are moving to forbid commas in particular in reference contig names in SAM and VCF — see https://github.com/samtools/hts-specs/pull/333 et al

nice to see this. I absolutely agree with this idea. With vcf4.3 it already started that the INFOand 'FORMATtags must much the regex^[A-Za-z ][0-9A-Za-z .]*\$. It would be consequent to do this for contig names as well.

fin swimmer

0
Entering edit mode

This thread is nearly 3 years old, and I'm getting the same error, so if indeed it is a bug it still hasn't been corrected (to my knowledge, I installed the most recent version)

0
Entering edit mode

Since March 2020, commas are disallowed in VCF (v4.3) CONTIG IDs. Whether you are using v4.3 or an earlier version, tools such as bcftools are likely to take advantage of the restriction on commas — so for an easy life you should avoid putting commas in any of your VCF CONTIG IDs.

You don't show your headers, so we don't know if they are exhibiting the same problem as the OP's comma-containing IDs like Eucgr.F02756.1_MYB52,BW52`. If this is the problem, then it is not a bug — it is a sensible limitation in the VCF format and tools, one that you need to deal with.