Check out the fuc
package I wrote:
API solution (Python)
>>> from fuc import pyvcf
>>> data = {
... 'CHROM': ['chr1', 'chr1', 'chr1'],
... 'POS': [100, 101, 102],
... 'ID': ['.', '.', '.'],
... 'REF': ['G', 'T', 'T'],
... 'ALT': ['A', 'C', 'G'],
... 'QUAL': ['.', '.', '.'],
... 'FILTER': ['.', '.', '.'],
... 'INFO': ['.', '.', '.'],
... 'FORMAT': ['GT:DP:AD:GQ', 'GT:DP:AD:GQ', 'GT:DP:AD:GQ'],
... 'A': ['0/0:26:0,26:30', '0/1:32:16,16:19', '0/0:.:.:.'],
... 'B': ['./.:.:.:.', '0/0:31:29,2:60', './.:.:.:.'],
... 'C': ['0/1:18:12,6:40', '0/0:24:24,0:12', '1/1:8:0,8:0'],
... }
>>> vf = pyvcf.VcfFrame.from_dict([], data)
>>> # vf = pyvcf.VcfFrame.from_file('your_file.vcf')
>>> vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A B C
0 chr1 100 . G A . . . GT:DP:AD:GQ 0/0:26:0,26:30 ./.:.:.:. 0/1:18:12,6:40
1 chr1 101 . T C . . . GT:DP:AD:GQ 0/1:32:16,16:19 0/0:31:29,2:60 0/0:24:24,0:12
2 chr1 102 . T G . . . GT:DP:AD:GQ 0/0:.:.:. ./.:.:.:. 1/1:8:0,8:0
>>> filtered_vf = vf.markmiss('GQ >= 20 and DP >= 10 and 0.2 <= sum(AD[1:])/DP <= 0.8', opposite=True)
>>> filtered_vf.df
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT A B C
0 chr1 100 . G A . . . GT:DP:AD:GQ ./.:.:.:. ./.:.:.:. 0/1:18:12,6:40
1 chr1 101 . T C . . . GT:DP:AD:GQ ./.:.:.:. ./.:.:.:. ./.:.:.:.
2 chr1 102 . T G . . . GT:DP:AD:GQ ./.:.:.:. ./.:.:.:. ./.:.:.:.
>>> # filtered_vf.to_file('filtered.vcf')
CLI solution
$ fuc vcf_filter in.vcf 'GQ >= 20 and DP >= 10 and 0.2 <= sum(AD[1:])/DP <= 0.8' --opposite > out.vcf
For getting help:
$ fuc vcf_filter -h
usage: fuc vcf_filter [-h] [--greedy] [--opposite] [--samples PATH] vcf expr
This command will filter a VCF file (both zipped and unzipped). It essentially
wraps the 'pyvcf.VcfFrame.markmiss' method from the fuc API.
usage examples:
$ fuc vcf_filter in.vcf 'GT == "0/0"' > out.vcf
$ fuc vcf_filter in.vcf 'GT != "0/0"' > out.vcf
$ fuc vcf_filter in.vcf 'DP < 30' > out.vcf
$ fuc vcf_filter in.vcf 'DP < 30' --greedy > out.vcf
$ fuc vcf_filter in.vcf 'AD[1] < 10' --greedy > out.vcf
$ fuc vcf_filter in.vcf 'AD[1] < 10 and DP < 30' --greedy > out.vcf
$ fuc vcf_filter in.vcf 'AD[1] < 10 or DP < 30' --greedy > out.vcf
$ fuc vcf_filter in.vcf 'AD[1] < 10 or DP < 30' --opposite > out.vcf
$ fuc vcf_filter in.vcf 'np.mean(AD) < 10' --greedy --samples sample.list > out.vcf
positional arguments:
vcf VCF file
expr expression to evaluate
optional arguments:
-h, --help show this help message and exit
--greedy use this flag to mark even ambiguous genotypes as missing
--opposite use this flag to mark all genotypes that do not satisfy the
query expression as missing and leave those that do intact
--samples PATH file of sample names to apply the marking (one sample per
line)