Remove line with awk in vcf.gz
2
Hello everyone.
I have a vcf.gz file and I want to filter the columns that contain "." .
I did this zcat file.vcf.gz | grep -v "#" | awk ' $4=="." || $5=="." ' .
however I don't know how to delete them and save the new file in vcf.gz format .
Thank you for your help.
SNP
• 5.0k views
Don't use grep/awk/... to filter vcf files. Instead use programs that are specialized on doing this, like bcftools.
Column 4 and 5 are the REF and ALT column. So you like to exclude all rows that have no value there:
$ bcftools view -e "REF=='.'||ALT=='.'" -o output.vcf.gz input.vcf.gz
zcat file.vcf.gz | awk '$1 ~ /^#/ {print $0;next} {if ($4 == "." || $5 == "." ) print }' | bgzip > new.vcf.gz
Will print all entries where $4 or $5 is . to a new compressed VCF file (bgzip for compression) preserving the header lines starting with #.
$1 ~ /^#/ {print $0;next} essentially means that if the line starts with # then print it (to preserve header lines).
{if ($4 == "." || $5 == "." ) print } tests if $4 or $5 is . and prints the entire row if true.
If you wanted entries with no . in either of the columns, that would be ($4 != "." && $5 != "." )
Edit: Agree with finswimmer that specialized tools such as bcftools are preferred to avoid any possible file corruption.
Login before adding your answer.
Traffic: 4033 users visited in the last hour
Thank you very much for your answer Yes my objective was to delete the lines containing ".'' So I will use instead
($4 !="." && $5 !=".").