Question: UnicodeDecodeError while reading vcf file
1
gravatar for Medhat
7 days ago by
Medhat8.3k
Texas
Medhat8.3k wrote:

I previously asked this question here: Reading vcf file using python gives UnicodeDecodeError

and the answer was to use vcftools instead of bcftools while merging files (the file I was trying to read is a merged file). At the moment this fixed my issue and I moved forward.

I revisited the files to know why one of them worked and the other did not work!

I found that; when using bcftools to merge the three files it gave this warning: [W::bcf_hdr_merge] Trying to combine "PS" tag definitions of different types .

As a result the file contains strange characters

GT:GQ:DP:GL:PS:ADALL:AD 0|1:415:.:-57.1927,0,-41.4646:<8D>^L:.:.  
GT:GQ:DP:GL:PS:ADALL:AD 1|0:346:.:-56.3461,0,-34.6188:<8D>^L:.:.      
 ./.:.:.:.:^A:.:.
  

As you can see <8D>^L and ^A should not be in the output.

When I visited the same locations but when using vcftools for merging :

GT:GQ:DP:GL:PS:ADALL:AD  0|1:.:.:.:-57.192706988686034,0.0,-41.46462464424806:415:68749
  

Is there a way to change this behavior in bcftools? or it is an issue in implementation?

Thanks

snp bcftools merge vcf • 73 views
ADD COMMENTlink written 7 days ago by Medhat8.3k

what is the definition of BOTH vcf header files for ##FORMAT=<ID=PS.....

ADD REPLYlink written 7 days ago by Pierre Lindenbaum120k
0
gravatar for Medhat
7 days ago by
Medhat8.3k
Texas
Medhat8.3k wrote:

The file resulted from using bcftools:

##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier">.

The file resulted from vcftools :

FORMAT=<ID=PS,Number=1,Type=String,Description="Phase set in which this variant falls">

__Update__

The answer is as suggested by Pierre Lindenbaum is to use the below command before merging:

sed 's/ID=PS,Number=1,Type=Integer,Descri/ID=PS,Number=1,Type=String,Descri/'
ADD COMMENTlink modified 4 days ago • written 7 days ago by Medhat8.3k

so the definitions are not the same. I would used sed to convert the first file:

sed 's/ID=PS,Number=1,Type=Integer,Descri/ID=PS,Number=1,Type=String,Descri/'
ADD REPLYlink modified 7 days ago • written 7 days ago by Pierre Lindenbaum120k

What I wrote is the header line in the merged file resulted from both tools, and as you can see the file resulted from bcftools is corrupted. using sed here will not change the corrupted results.

In case if you were asking for the files that were merged:

one of the files contains: ##FORMAT=<ID=PS,Number=1,Type=Integer,Description="Phase set identifier"> . the other two files does not contains this field.

ADD REPLYlink modified 7 days ago • written 7 days ago by Medhat8.3k

Please change that to an answer, it worked when I changed the file header before merge them then did the merging, there was no ambiguous character anymore.

ADD REPLYlink written 5 days ago by Medhat8.3k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 933 users visited in the last hour