sed (or similar) command to remove part of the FORMAT from a vcf file?
1
0
Entering edit mode
13 months ago
Jautis ▴ 300

Hello, I'm looking for a quick way to remove PGT, PID, and PS from the FORMAT of a vcf file output by GATK. Currently, some sites have these flags while others don't (see example below). For downstream processing in another program, I need all sites to have the same flags and phasing information doesn't matter, so the easiest way to achieve this will be removing the phasing data entirely from the vcf file. Do you have any suggestions for how to do this? It feels like it should be easy enough to do using sed or awk, but I can't figure it out.

chr1    1    .    G    A    100    .    DP=10    GT:AD:DP:GQ:PL               ./.:0,0:0:.:0,0,0
chr2    4    .    C    T    100    .    DP=10    GT:AD:DP:GQ:PGT:PID:PL:PS    0/0:1,0:1:3:.:.:0,3,45:.
chr3    2    .    A    T    100    .    DP=10    GT:AD:DP:GQ:PGT:PID:PL:PS    0|1:1,2:3:36:0|1:153_C_T:81,0,36:153


I can use sed -e 's/:\.:\.:/:/g' | sed -e 's/:\.\t/\t/g' | sed -e 's/GT:AD:DP:GQ:PGT:PID:PL:PS/GT:AD:DP:GQ:PL/g' to remove most of the information (shown below), but can't figure out how to deal with the third case where there is phased data.

chr1    1    .    G    A    100    .    DP=10    GT:AD:DP:GQ:PL    ./.:0,0:0:.:0,0,0
chr2    4    .    C    T    100    .    DP=10    GT:AD:DP:GQ:PL    0/0:1,0:1:3:0,3,45
chr3    2    .    A    T    100    .    DP=10    GT:AD:DP:GQ:PL    0|1:1,2:3:36:0|1:153_C_T:81,0,36:153


This is what I would like as the final result. Transforming 0|1 into 0/1 is straightforward, but I'm having a difficult time figuring out how to remove the information contained in the PGT, PID, and PS areas when it's not consistent across sites.

chr1    1    .    G    A    100    .    DP=10    GT:AD:DP:GQ:PL    ./.:0,0:0:.:0,0,0
chr2    4    .    C    T    100    .    DP=10    GT:AD:DP:GQ:PL    0/0:1,0:1:3:0,3,45
chr3    2    .    A    T    100    .    DP=10    GT:AD:DP:GQ:PL    0/1:1,2:3:36:81,0,36


vcf sed processing • 369 views
2
Entering edit mode
13 months ago

Hello,

bcftools is your friend :)

bcftools annotate is able to remove or keep information from a vcf record.

bcftools +setGT can manipulate the genotype.

Try this:

\$ bcftools annotate -x ^INFO/DP,^FORMAT/GT,^FORMAT/AD,^FORMAT/DP,^FORMAT/GQ,^FORMAT/PL input.vcf | bcftools +setGT -- -ta -nu


fin swimmer

0
Entering edit mode

Awesome, thank you! I didn't know bcf would get rid of tags in addition to writing new ones