Remove line with awk in vcf.gz
2
0
Entering edit mode
4.7 years ago
Sillpositive ▴ 20

Hello everyone.

I have a vcf.gz file and I want to filter the columns that contain "." .

I did this zcat file.vcf.gz | grep -v "#" | awk ' $4=="." || $5=="." ' .

however I don't know how to delete them and save the new file in vcf.gz format .

Thank you for your help.

SNP • 3.7k views
ADD COMMENT
0
Entering edit mode

Thank you very much for your answer Yes my objective was to delete the lines containing ".'' So I will use instead ($4 !="." && $5 !=".").

ADD REPLY
3
Entering edit mode
4.7 years ago

Don't use grep/awk/... to filter vcf files. Instead use programs that are specialized on doing this, like bcftools.

Column 4 and 5 are the REF and ALT column. So you like to exclude all rows that have no value there:

$ bcftools view -e "REF=='.'||ALT=='.'" -o output.vcf.gz input.vcf.gz
ADD COMMENT
0
Entering edit mode

Thank you finswimmer for you response ! But with bcftool it's possible to keep the vcf.gz format ? thank you !

ADD REPLY
1
Entering edit mode

Yes, of course. I've edited my answer.

ADD REPLY
0
Entering edit mode

is output type not necessary? -O ?

ADD REPLY
0
Entering edit mode

I also used -Oz option to have vcf.gz format !

ADD REPLY
1
Entering edit mode
4.7 years ago
ATpoint 81k
zcat file.vcf.gz | awk '$1 ~ /^#/ {print $0;next} {if ($4 == "." || $5 == "." ) print }' | bgzip > new.vcf.gz

Will print all entries where $4 or $5 is . to a new compressed VCF file (bgzip for compression) preserving the header lines starting with #.

$1 ~ /^#/ {print $0;next} essentially means that if the line starts with # then print it (to preserve header lines).

{if ($4 == "." || $5 == "." ) print } tests if $4 or $5 is . and prints the entire row if true.

If you wanted entries with no . in either of the columns, that would be ($4 != "." && $5 != "." )


Edit: Agree with finswimmer that specialized tools such as bcftools are preferred to avoid any possible file corruption.

ADD COMMENT
0
Entering edit mode

Thank you very much for your answer Yes my objective was to delete the lines containing ".'' So I will use instead ($4 !="." && $5 !=".").

ADD REPLY

Login before adding your answer.

Traffic: 2463 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6