Automatic parsing of VCF and correcting value types
1
0
Entering edit mode
4.5 years ago

I have a pipeline written in Hail 0.1 for VCF processing:

https://github.com/macarthur-lab/hail-elasticsearch-pipelines

The pipeline can't process VCF files that have 'nan' values in 'Float' - type of fields. So, I found the solution of making a fake header with 'String' - types instead of 'Float's and then using it to load VCF into the pipeline:

https://discuss.hail.is/t/vds-summarize-report-error-in-hail-0-1/562/7

So, now I need a way to automate it. I want to make a script that loads VCF, tests for any Float values that have 'nan' values in it, and then changes the types of fields to String. Are there any good tools for such VCF - parsing and modification? The only solution I can think of for now is to just straight apply Hail 0.1 methods that cause exception (like 'summarize()' method) in a for loop and using regular matching schemes correcting Float to String which does sound too complex, I would expect easier solutions out there.

VCF nan • 1.2k views
ADD COMMENT
0
Entering edit mode

So, now I need a way to automate it.

sed 's/ATT=nan;/ATT=0;/g'

?

ADD REPLY
0
Entering edit mode

I am wondering, what is ATT? I need to somehow check the whole file, find 'nan', then check what is the field type, and then change it to String. Another way - and maybe way simpler - would be just to substitute 'nan' to zero, but is it correct? Maybe it is, I just try to make sure that it would not change down the road calculations to something wrong.

If by 'ATT' you mean any attribute name, then maybe I need somehow to substitute it to make it general since I do not know beforehand the list of attributes that will contain 'nan'

ADD REPLY
0
Entering edit mode

If by 'ATT' you mean any attribute name, then maybe I need somehow to substitute it to make it general since I do not know beforehand the list of attributes that will contain 'nan'

sed 's/\([A-Z_a-z0-9]*\)=nan;/\1=0;/g'
ADD REPLY
0
Entering edit mode

Still getting an error:

unable to convert [1.618, nan] (of class java.util.ArrayList) to Array[Double]

Probably need to add [...] somehow

I am thinking of just using sed 's/nan/0/g' but afraid that if 'nan' happens as a part of some name, it will be mistakenly substituted

ADD REPLY
0
Entering edit mode

I tried making it match the other pattern but it is not working:

sed 's/\(\[([0-9]*[.])?[0-9]+[, ]+\)nan;/\1, 0;/g'
ADD REPLY
0
Entering edit mode
4.5 years ago

What ultimately worked for just substitution of 'nan' to zeros in VCF is the following script:

#!/bin/bash

# Remove all 'nan' of type {paramName}=nan and of type [{floatNum}, nan]

sed_commands_array=("s/\([A-Z_a-z0-9]*\)=nan\([,;]\)/\1=0\2/g;"
"s/\([[-]\?[0-9]*\.\?[0-9]*\), nan/\1, 0/g;"
"s/\([A-Z_a-z0-9]*=[-]\?[0-9]*\.\?[0-9]*\),nan/\1,0/g")

IFS=''

sed_command="${sed_commands_array[*]}"

sed $sed_command $1 > "${1%.*}_cleaned.vcf"
ADD COMMENT

Login before adding your answer.

Traffic: 1323 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6