Question: sed (or similar) command to remove part of the FORMAT from a vcf file?
0
gravatar for Jautis
4 months ago by
Jautis290
United States
Jautis290 wrote:

Hello, I'm looking for a quick way to remove PGT, PID, and PS from the FORMAT of a vcf file output by GATK. Currently, some sites have these flags while others don't (see example below). For downstream processing in another program, I need all sites to have the same flags and phasing information doesn't matter, so the easiest way to achieve this will be removing the phasing data entirely from the vcf file. Do you have any suggestions for how to do this? It feels like it should be easy enough to do using sed or awk, but I can't figure it out.

chr1    1    .    G    A    100    .    DP=10    GT:AD:DP:GQ:PL               ./.:0,0:0:.:0,0,0
chr2    4    .    C    T    100    .    DP=10    GT:AD:DP:GQ:PGT:PID:PL:PS    0/0:1,0:1:3:.:.:0,3,45:.
chr3    2    .    A    T    100    .    DP=10    GT:AD:DP:GQ:PGT:PID:PL:PS    0|1:1,2:3:36:0|1:153_C_T:81,0,36:153

I can use sed -e 's/:\.:\.:/:/g' | sed -e 's/:\.\t/\t/g' | sed -e 's/GT:AD:DP:GQ:PGT:PID:PL:PS/GT:AD:DP:GQ:PL/g' to remove most of the information (shown below), but can't figure out how to deal with the third case where there is phased data.

chr1    1    .    G    A    100    .    DP=10    GT:AD:DP:GQ:PL    ./.:0,0:0:.:0,0,0
chr2    4    .    C    T    100    .    DP=10    GT:AD:DP:GQ:PL    0/0:1,0:1:3:0,3,45
chr3    2    .    A    T    100    .    DP=10    GT:AD:DP:GQ:PL    0|1:1,2:3:36:0|1:153_C_T:81,0,36:153

This is what I would like as the final result. Transforming 0|1 into 0/1 is straightforward, but I'm having a difficult time figuring out how to remove the information contained in the PGT, PID, and PS areas when it's not consistent across sites.

chr1    1    .    G    A    100    .    DP=10    GT:AD:DP:GQ:PL    ./.:0,0:0:.:0,0,0
chr2    4    .    C    T    100    .    DP=10    GT:AD:DP:GQ:PL    0/0:1,0:1:3:0,3,45
chr3    2    .    A    T    100    .    DP=10    GT:AD:DP:GQ:PL    0/1:1,2:3:36:81,0,36

Thank you in advance!

sed processing vcf • 141 views
ADD COMMENTlink modified 4 months ago by finswimmer13k • written 4 months ago by Jautis290
2
gravatar for finswimmer
4 months ago by
finswimmer13k
Germany
finswimmer13k wrote:

Hello,

bcftools is your friend :)

bcftools annotate is able to remove or keep information from a vcf record.

bcftools +setGT can manipulate the genotype.

Try this:

$ bcftools annotate -x ^INFO/DP,^FORMAT/GT,^FORMAT/AD,^FORMAT/DP,^FORMAT/GQ,^FORMAT/PL input.vcf | bcftools +setGT -- -ta -nu

fin swimmer

ADD COMMENTlink written 4 months ago by finswimmer13k

Awesome, thank you! I didn't know bcf would get rid of tags in addition to writing new ones

ADD REPLYlink written 4 months ago by Jautis290
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 857 users visited in the last hour