Question: Remove ambiguous calls in the VCF file
0
gravatar for SOHAIL
20 months ago by
SOHAIL310
Beijing Institute of Genomics, CAS.
SOHAIL310 wrote:

Hi everyone,

I have a VCF file with multiple ambiguous ref/alt calls at some positions of the genome with ref allele type Y, R, M, K, S, W (i.e. two-base ambiguity codes). e.g.

3       60830534        .       M       C       101     .       .       GT:DP:A:C:G:T:PP:GQ     1/1:24:0,0:14,9:0,0:1,0:1038,0,808,782,114,898,883,114,101,806:101,

is there any way to remove them all from the VCF file?

Kind Regards sohail

vcf • 1.0k views
ADD COMMENTlink modified 20 months ago by ATpoint34k • written 20 months ago by SOHAIL310
0
gravatar for ATpoint
20 months ago by
ATpoint34k
Germany
ATpoint34k wrote:

This will print the header of the VCF and only those entries where both REF and ALT are A/C/T/G:

awk '$1 ~ /^#/ {print $0;next} {if ($4 ~ /A|C|T|G/ && $5 ~ /A|C|T|G/) print $0}' in.vcf > filtered.vcf
ADD COMMENTlink written 20 months ago by ATpoint34k

@ATpoint Thanks for the reply!

Your command is working good for the type of VCF file where only variants are only called (i.e. at both columns of REF/ALT A/T/G/C should be present).

However, my VCF file is called with all genotypes of the genome "all-positions" (either homo ref or homo alt or het sites) together with ambiguous variant call set. and column 5 (ALT) of VCF might be filled with the period (i.e. dot symbol) e.g.

3       60830534        .       M       C       101     .       .       GT:DP:A:C:G:T:PP:GQ     1/1:24:0,0:14,9:0,0:1,0:1038,0,808,782,114,898,883,114,101,806:101,
3       60830535        .       C       .       101     .       .       GT:DP:A:C:G:T:PP:GQ     1/1:24:0,0:14,9:0,0:1,0:1038,0,808,782,114,898,883,114,101,806:101,

When i modified the command with following, ambiguous call is still there

awk '$1 ~ /^#/ {print $0;next} {if ($4 ~ /A|C|T|G/ && $5 ~ /.|A|C|T|G/) print $0}' in.vcf > filtered.vcf

am I doing any mistake?

ADD REPLYlink modified 20 months ago by finswimmer13k • written 20 months ago by SOHAIL310

Sorry, I do not get it. From the two lines above, the one where REF is M and the one with ALT ., which of these should be removed?

ADD REPLYlink written 20 months ago by ATpoint34k

@ATpoint, The lines with M (and others Y, R, W, K, S, (i.e. two-base ambiguity codes) ) in VCF file will be removed.

edit: any help???

ADD REPLYlink modified 20 months ago • written 20 months ago by SOHAIL310
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1416 users visited in the last hour