to create a multi-sample VCF in a large cohort of WES samples of very different quality I have to select only high-quality variants genotyped in as many samples as possible.
I figured out that
- long indels have low quality
- only substitutions do not provide enough variants for my analysis.
I know how to filter out indels using bcftools - is there a command that may filter out long indels only, but remain 1-2bp inserts/deletions? I feel some AWK command should be very fast, but I don't know how to count number of chars in columns ALT/REF of the VCF and how to print only variants where both ALT/REF variants are shorter than 3 symbols.
Appreciate any help, quick googling did not solve the problem.
UPD: My ugly solution based on Ram's comment:
zcat final_all_merged.vcf.gz | grep "#" > only_short_indels.vcf zcat final_all_merged.vcf.gz | awk 'length($5) + length($(4)) < 4' >> only_short_indels.vcf gzip only_short_indels.vcf
I believe Pierre's solution will also work, just too lazy to install additional toolkit on cluster...
UPD1: one liner
zcat final_all_merged.vcf.gz | awk '($1 ~ /^#/ || length($5) + length($(4)) < 4)' | gzip > only_short_indels.vcf.gz