Question

How to do sample based QC via bcftools

1

Entering edit mode

3.2 years ago

nobody • 0

Hi Team ...

I want to use bcftools, or any other non-hail way, to achieve below filtering on my vcf files ...

#filter for samples that are > 95% call rate

hail way

mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.95)

I'm very new to the bioinformatics world and have used bcftools till now to do following tasks ...

normalization of vcf files ( via norm option of bcftools )
filtering samples from vcf file ( by passing -s and -S flags to view option of bcftools )
basic column based filtering like, via filter -i 'FILTER=="PASS" || FILTER=="MONOALLELIC"'
concatenating blocks of vcf files ( via concat option of bcftools )

Can anyone please help me to achieve filtering for samples that have 95% of call rate?

or at least break down the steps required to achieve it and I will try to decode the manual of bcftools to achieve those steps ..

I tried finding solution in already answered question but didn't find a solution ... if someone has bookmarked the solution then please share that ...

May be this would help others too who don't want to use hail for basic filtering ...

I would appreciate any help, any pointer ...

thanks team ...

ps : if someone know a good video lecture or course that walks through some example scenarios with bcftools then please do share ...

bcftools qc vcf hail call rate • 3.6k views

ADD COMMENT • link updated 18 months ago by lb61 • 0 • written 3.2 years ago by nobody • 0

0

Entering edit mode

What is "call rate" and how is it measured? I have a hunch that hail does QC while converting VCF to the custom format it uses (VDS?) and that's how you're getting ready access to this metric. If hail calculates it on the fly, maybe we can figure out a way to do it using bcftools.

ADD REPLY • link 3.2 years ago by Ram 43k

0

Entering edit mode

IMHO , I think call rate is as defined by @Kevin Blighe in below thread !!!

What is SNP call rate?

'Call rate' is the proportion or percentage of samples in which a confident genotype call could be made. 95% is a typical value.

In the other samples in which a call could not be made, the probes may have failed, resulting in no binding to the template DNA. It is also possible that the DNA was lesser quality in these samples, or non-existent, for whatever reason(s).

Though he talks only about SNP ... in VCF context we can consider all the variants like indels, SNPs, etc

Much Thanks

ADD REPLY • link 3.2 years ago by nobody • 0

2

Entering edit mode

So you're looking for a fraction of samples that have a called GT. I took a quick look at the bcftools manual, and F_MISSING seems like the computed metric you're looking for.

variables calculated on the fly if not present: number of alternate alleles; number of samples; count of alternate alleles; minor allele count (similar to AC but is always smaller than 0.5); frequency of alternate alleles (AF=AC/AN); frequency of minor alleles (MAF=MAC/AN); number of alleles in called genotypes; number of samples with missing genotype; fraction of samples with missing genotype; indel length (deletions negative, insertions positive)

N_ALT, N_SAMPLES, AC, MAC, AF, MAF, AN, N_MISSING, F_MISSING, ILEN

So,

bcftools view -i 'F_MISSING < 0.05' -Ov -o out.vcf in.vcf.gz

ADD REPLY • link 3.2 years ago by Ram 43k

0

Entering edit mode

Thanks a ton @_r_am, I will try to dig into this aspect of bcftools ... this tool is so so extensive !!! I would also think through if this metrics matches with the calculation done by hail I had added some code and comment from the hail documentation( you added response before I could read through the hail code, I don't know how you do what you do, I had seen you also moderated my original question to make portion of it as "code" ... much respect !!!)...

I again thank you for your time and energy @_r_ram ... muchas gracias !!!

ADD REPLY • link 3.2 years ago by nobody • 0

0

Entering edit mode

One more quick question _r_am ...

if I want to filter my sample who have mean of DP i.e. Approximate read depth, higher than a particular value say 20 .... then can I use AVG(DP)>20 directly like below?

bcftools view -i 'AVG(DP)>20' -Ov -o out.vcf in.vcf.gz

asking as there is another function in hail which I would like to use, and your answer above has probably nudged me in right direction ... just want to confirm, as you know your kind response would really help !!!

sample_qc.dp_stats.mean >= 20

thanks ... much respect !!!

ADD REPLY • link 3.2 years ago by nobody • 0

0

Entering edit mode

Can I use ... like below?

I don't know, does it work? I cannot predict if bcftools' output will be concordant to your expectations, I can only point you to a feature. Why don't you run a few tests and see if the hail output corresponds to bcftools output?

ADD REPLY • link 3.2 years ago by Ram 43k

0

Entering edit mode

thanks for the kind response _r_am ... hail requires cluster enviornment and thus I wanted to skip it altogether, as I don't want to add more complexity to my pipeline ...

I didn't mean to ask if they will give same response, but if "logically" AVG(DP)> {INTEGER} would do filtering of samples which have average DP more than the integer values...

I again thank you for taking out time to respond to beginners like me ...

Much Thanks ...

ADD REPLY • link 3.2 years ago by nobody • 0

1

Entering edit mode

I mean, I think it would, but I can't be sure. Plainly speaking, I don't want you to take my word for it as I might be wrong. Test it yourself.

ADD REPLY • link 3.2 years ago by Ram 43k

0

Entering edit mode

From the documentation of the HAIL

this is how call_rate is calculated

call_rate (float64) – Fraction of calls not missing or filtered. Equivalent to n_called divided by count_rows().

where,

n_called is the Number of non-missing calls.

count_rows() Counts the number of rows in the matrix

( which I think would translate to total number of variants in vcf file)

They also mention that n_called , call_rate etc are calculated based from GT ...

digging further in source code of hail provides below info

'call_rate': hl.float64(x.n_called) / (x.n_called + x.n_not_called + x.n_filtered)

where

 bound_exprs['n_called'] = hl.agg.count_where(hl.is_defined(mt['GT']))
 bound_exprs['n_not_called'] = hl.agg.count_where(hl.is_missing(mt['GT']))

comments in code

- `n_called` (``int64``) -- Number of non-missing calls.
- `n_not_called` (``int64``) -- Number of missing calls.
- `n_filtered` (``int64``) -- Number of filtered entries.

ADD REPLY • link 3.2 years ago by nobody • 0

score 0 · Answer 1 · 2022-10-24

Following up on this — I'm also making a switch from Hail to bcftools and am trying to do similar sample-level QC! It seems like Ram 's suggestion of filtering by F_MISSING performs variant-level QC (ie, removing variants where a certain proportion of samples are missing genotype calls). However, is there a way to instead remove samples where a certain proportion of sites/variants are missing genotype calls?

The difference in Hail would be between mt = mt.filter_rows(mt.variant_qc.call_rate >= 0.95) (equivalent to bcftools view -i 'F_MISSING<0.05') versus mt = mt.filter_cols(mt.sample_qc.call_rate >= 0.95) (this is what I'm trying to find the bcftools equivalent of).

It seems like most of the filtering options with bcftools are really designed to filter out variants as opposed to samples, so I'm curious if there's a good way to do this. Thanks!