Question: Remove ambiguous calls in the VCF file
0
gravatar for SOHAIL
7 months ago by
SOHAIL270
Beijing Institute of Genomics, CAS.
SOHAIL270 wrote:

Hi everyone,

I have a VCF file with multiple ambiguous ref/alt calls at some positions of the genome with ref allele type Y, R, M, K, S, W (i.e. two-base ambiguity codes). e.g.

3       60830534        .       M       C       101     .       .       GT:DP:A:C:G:T:PP:GQ     1/1:24:0,0:14,9:0,0:1,0:1038,0,808,782,114,898,883,114,101,806:101,

is there any way to remove them all from the VCF file?

Kind Regards sohail

vcf • 432 views
ADD COMMENTlink modified 7 months ago by ATpoint16k • written 7 months ago by SOHAIL270
0
gravatar for ATpoint
7 months ago by
ATpoint16k
Germany
ATpoint16k wrote:

This will print the header of the VCF and only those entries where both REF and ALT are A/C/T/G:

awk '$1 ~ /^#/ {print $0;next} {if ($4 ~ /A|C|T|G/ && $5 ~ /A|C|T|G/) print $0}' in.vcf > filtered.vcf
ADD COMMENTlink written 7 months ago by ATpoint16k

@ATpoint Thanks for the reply!

Your command is working good for the type of VCF file where only variants are only called (i.e. at both columns of REF/ALT A/T/G/C should be present).

However, my VCF file is called with all genotypes of the genome "all-positions" (either homo ref or homo alt or het sites) together with ambiguous variant call set. and column 5 (ALT) of VCF might be filled with the period (i.e. dot symbol) e.g.

3       60830534        .       M       C       101     .       .       GT:DP:A:C:G:T:PP:GQ     1/1:24:0,0:14,9:0,0:1,0:1038,0,808,782,114,898,883,114,101,806:101,
3       60830535        .       C       .       101     .       .       GT:DP:A:C:G:T:PP:GQ     1/1:24:0,0:14,9:0,0:1,0:1038,0,808,782,114,898,883,114,101,806:101,

When i modified the command with following, ambiguous call is still there

awk '$1 ~ /^#/ {print $0;next} {if ($4 ~ /A|C|T|G/ && $5 ~ /.|A|C|T|G/) print $0}' in.vcf > filtered.vcf

am I doing any mistake?

ADD REPLYlink modified 7 months ago by finswimmer11k • written 7 months ago by SOHAIL270

Sorry, I do not get it. From the two lines above, the one where REF is M and the one with ALT ., which of these should be removed?

ADD REPLYlink written 7 months ago by ATpoint16k

@ATpoint, The lines with M (and others Y, R, W, K, S, (i.e. two-base ambiguity codes) ) in VCF file will be removed.

edit: any help???

ADD REPLYlink modified 7 months ago • written 7 months ago by SOHAIL270
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 732 users visited in the last hour