UnicodeDecodeError while reading vcf file
1
1
Entering edit mode
22 months ago
Medhat 8.9k

I previously asked this question here: Reading vcf file using python gives UnicodeDecodeError

and the answer was to use vcftools instead of bcftools while merging files (the file I was trying to read is a merged file). At the moment this fixed my issue and I moved forward.

I revisited the files to know why one of them worked and the other did not work!

I found that; when using bcftools to merge the three files it gave this warning: [W::bcf_hdr_merge] Trying to combine "PS" tag definitions of different types .

As a result the file contains strange characters

GT:GQ:DP:GL:PS:ADALL:AD 0|1:415:.:-57.1927,0,-41.4646:<8D>^L:.:.  
GT:GQ:DP:GL:PS:ADALL:AD 1|0:346:.:-56.3461,0,-34.6188:<8D>^L:.:.      
 ./.:.:.:.:^A:.:.
  

As you can see <8D>^L and ^A should not be in the output.

When I visited the same locations but when using vcftools for merging :

GT:GQ:DP:GL:PS:ADALL:AD  0|1:.:.:.:-57.192706988686034,0.0,-41.46462464424806:415:68749
  

Is there a way to change this behavior in bcftools? or it is an issue in implementation?

Thanks

SNP vcf bcftools merge • 429 views
ADD COMMENT
0
Entering edit mode

what is the definition of BOTH vcf header files for ##FORMAT=<ID=PS.....

ADD REPLY
0
Entering edit mode
22 months ago
Medhat 8.9k

The file resulted from using bcftools:

##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">.

The file resulted from vcftools :

FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set in which this variant falls">

__Update__

The answer is as suggested by Pierre Lindenbaum is to use the below command before merging:

sed 's/ID=PS,Number=1,Type=Integer,Descri/ID=PS,Number=1,Type=String,Descri/'
ADD COMMENT
0
Entering edit mode

so the definitions are not the same. I would used sed to convert the first file:

sed 's/ID=PS,Number=1,Type=Integer,Descri/ID=PS,Number=1,Type=String,Descri/'
ADD REPLY
0
Entering edit mode

What I wrote is the header line in the merged file resulted from both tools, and as you can see the file resulted from bcftools is corrupted. using sed here will not change the corrupted results.

In case if you were asking for the files that were merged:

one of the files contains: ##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier"> . the other two files does not contains this field.

ADD REPLY
0
Entering edit mode

Please change that to an answer, it worked when I changed the file header before merge them then did the merging, there was no ambiguous character anymore.

ADD REPLY

Login before adding your answer.

Traffic: 1740 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6