Filtering long indels from VCF
1
0
Entering edit mode
5 weeks ago

Hi,

to create a multi-sample VCF in a large cohort of WES samples of very different quality I have to select only high-quality variants genotyped in as many samples as possible.

I figured out that

  1. long indels have low quality
  2. only substitutions do not provide enough variants for my analysis.

I know how to filter out indels using bcftools - is there a command that may filter out long indels only, but remain 1-2bp inserts/deletions? I feel some AWK command should be very fast, but I don't know how to count number of chars in columns ALT/REF of the VCF and how to print only variants where both ALT/REF variants are shorter than 3 symbols.

Appreciate any help, quick googling did not solve the problem.

UPD: My ugly solution based on Ram's comment:

zcat final_all_merged.vcf.gz | grep "#" > only_short_indels.vcf
zcat final_all_merged.vcf.gz | awk 'length($5) + length($(4)) < 4' >> only_short_indels.vcf
gzip only_short_indels.vcf

I believe Pierre's solution will also work, just too lazy to install additional toolkit on cluster...

UPD1: one liner

zcat final_all_merged.vcf.gz | awk '($1 ~ /^#/ || length($5) + length($(4)) < 4)' | gzip > only_short_indels.vcf.gz
VCF • 196 views
ADD COMMENT
1
Entering edit mode

Google is your friend here. One of the top results when I searched for nchar awk is this link: https://stackoverflow.com/questions/16613854/remove-lines-based-on-number-of-characters which shows a function called length, which I think you should be able to use like so: length($5)-length($4) > 3 (and add in something there to get the absolute value of the difference).

ADD REPLY
0
Entering edit mode

shame on me, should work indeed!

ADD REPLY
1
Entering edit mode
5 weeks ago

how to print only variants where both ALT/REF variants are shorter than 3 symbols.

using http://lindenb.github.io/jvarkit/VcfFilterJdk.html

java -jar dist/vcffilterjdk.jar -e 'return variant.getAlleles().stream().filter(A->!A.isSymbolic()).allMatch(A->A.length()<4&& A.length()>0);'
ADD COMMENT

Login before adding your answer.

Traffic: 2997 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6