sed (or similar) command to remove part of the FORMAT from a vcf file?
1
0
Entering edit mode
4.2 years ago
Jautis ▴ 530

Hello, I'm looking for a quick way to remove PGT, PID, and PS from the FORMAT of a vcf file output by GATK. Currently, some sites have these flags while others don't (see example below). For downstream processing in another program, I need all sites to have the same flags and phasing information doesn't matter, so the easiest way to achieve this will be removing the phasing data entirely from the vcf file. Do you have any suggestions for how to do this? It feels like it should be easy enough to do using sed or awk, but I can't figure it out.

chr1    1    .    G    A    100    .    DP=10    GT:AD:DP:GQ:PL               ./.:0,0:0:.:0,0,0
chr2    4    .    C    T    100    .    DP=10    GT:AD:DP:GQ:PGT:PID:PL:PS    0/0:1,0:1:3:.:.:0,3,45:.
chr3    2    .    A    T    100    .    DP=10    GT:AD:DP:GQ:PGT:PID:PL:PS    0|1:1,2:3:36:0|1:153_C_T:81,0,36:153

I can use sed -e 's/:\.:\.:/:/g' | sed -e 's/:\.\t/\t/g' | sed -e 's/GT:AD:DP:GQ:PGT:PID:PL:PS/GT:AD:DP:GQ:PL/g' to remove most of the information (shown below), but can't figure out how to deal with the third case where there is phased data.

chr1    1    .    G    A    100    .    DP=10    GT:AD:DP:GQ:PL    ./.:0,0:0:.:0,0,0
chr2    4    .    C    T    100    .    DP=10    GT:AD:DP:GQ:PL    0/0:1,0:1:3:0,3,45
chr3    2    .    A    T    100    .    DP=10    GT:AD:DP:GQ:PL    0|1:1,2:3:36:0|1:153_C_T:81,0,36:153

This is what I would like as the final result. Transforming 0|1 into 0/1 is straightforward, but I'm having a difficult time figuring out how to remove the information contained in the PGT, PID, and PS areas when it's not consistent across sites.

chr1    1    .    G    A    100    .    DP=10    GT:AD:DP:GQ:PL    ./.:0,0:0:.:0,0,0
chr2    4    .    C    T    100    .    DP=10    GT:AD:DP:GQ:PL    0/0:1,0:1:3:0,3,45
chr3    2    .    A    T    100    .    DP=10    GT:AD:DP:GQ:PL    0/1:1,2:3:36:81,0,36

Thank you in advance!

vcf sed processing • 2.3k views
ADD COMMENT
2
Entering edit mode
4.2 years ago

Hello,

bcftools is your friend :)

bcftools annotate is able to remove or keep information from a vcf record.

bcftools +setGT can manipulate the genotype.

Try this:

$ bcftools annotate -x ^INFO/DP,^FORMAT/GT,^FORMAT/AD,^FORMAT/DP,^FORMAT/GQ,^FORMAT/PL input.vcf | bcftools +setGT -- -ta -nu

fin swimmer

ADD COMMENT
0
Entering edit mode

Awesome, thank you! I didn't know bcf would get rid of tags in addition to writing new ones

ADD REPLY

Login before adding your answer.

Traffic: 2996 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6