how to remove multiallelic from VCF
4
11
Entering edit mode
7.2 years ago
xinhui.wang ▴ 440

I am working on VCF dorument and there are many indels and multiallelic in my VCF. However, it is not necessary for me to includ them in our analysis and I would like to remove them. Is there a simple way to do that with vcftools or bcftools?

Any help would be greatly appreciated.

Thanks and with best regards,

Xinhui

SNP • 25k views
26
Entering edit mode
7.2 years ago

You can filter anything you want using bcftools view. In your case, if you want to filter out indels and multiallelic, you would need something like this:

bcftools view --max-alleles 2 --exclude-types indels input.vcf.gz


A typical command to filter out anything but biallelic SNPs, as stated in the bcftools manual, is the following:

bcftools view -m2 -M2 -v snps input.vcf.gz

0
Entering edit mode

really great and I used the second one since biallelic snps are we need.

1
Entering edit mode

be careful with this command since it gives biallelic SNPs only. if you want to extract all SNPs with AT MOST 2 alleles, (which is not strictly the same as BIALLELIC, then you need to remove the -m2 option.

0
Entering edit mode

Sorry. Do you mind to give me a small example of between biallelic snps with snps at most 2 alleles? Thanks and with best regards, Xinhui

1
Entering edit mode

strictly speaking all snps are at least biallelic, since the variant detection implies that there's other allele apart from the reference one, so you shouldn't worry if you're working with variants only. the vcf format allows you to define positions where you may have a reference allele but not a alternative allele, and those ones would be removed on the first code. this is indeed not common at all, but you always have to keep in mind what you're filtering out, so removing the -m2 is a way not to worry about that possible issue.

0
Entering edit mode

Not works well for me. rs1799966 is tri-allelic, however, rs1799966 still left after -m2 -M2 -v snps

bcftools view -f PASS -m2 -M2 -v snps -R VIP.hg19.bed gnomad.exomes.r2.1.sites.chr17.vcf.bgz > VIP.chr17.vcf
grep rs1799966 VIP.chr17.vcf | sort -u | less -S

4
Entering edit mode

Use -m2 -M2 -v snps to only view biallelic SNPs.

so if it isn't working it must be either a problem with bcftools (that'd be odd) or either a problem with gnomad data.

if you query gnomad data for that particular position you'll see that there are 2 entries instead of 1, each one with 2 alleles maximum, so the -M2 filter from bcftools won't be able to filter them:

$bcftools view -r 17:41223094-41223094 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | grep -v ^# | cut -f1-7 17 41223094 rs1799966 T A 2.93471e+08 PASS 17 41223094 rs1799966 T C 2.93471e+08 PASS  if you want to do so, you'll have to join biallelic sites into multiallelic records in advance, using bcftools norm -m+ for instance: $ bcftools view -r 17:41223094-41223094 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools norm -m+ | grep -v ^# | cut -f1-7
Lines   total/split/realigned/skipped:  2/0/0/0
17      41223094        rs1799966       T       A,C     2.93471e+08     PASS


and then try to filter your region of interest (I'm using 17:41223090-41223100 here as an example):

bcftools view -r 17:41223090-41223100 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools norm -m+ | bcftools view -m2 -M2 -v snps | grep -v ^# | cut -f1-7 Lines total/split/realigned/skipped: 6/0/0/0 17 41223090 rs766305255 G A 10717.2 PASS 17 41223091 rs70953660 G A 140083 PASS  EDIT: first bcftools view is not really needed, since bcftools norm is already capable of slicing the data by region of interest:  bcftools norm -m+ -r 17:41223090-41223100 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools view -m2 -M2 -v snps | grep -v ^# | cut -f1-7
Lines   total/split/realigned/skipped:  6/0/0/0
17      41223090        rs766305255     G       A       10717.2 PASS
17      41223091        rs70953660      G       A       140083  PASS

0
Entering edit mode

Yes. You are exactly right. That's the point I figure out later.

0
Entering edit mode

Another thing should be mentioned should be sort ? anyway, now I will sort the vcf first and then use -m2 -M2

1
Entering edit mode

note that the merging process when normalizing can be quite intense on such large files as gnomad's, so if you are interested in particular regions do not forget to indicate this in your normalization command. I've edited my answer to reflect this.

0
Entering edit mode

Hi Jorge, What kind of suffix would be preferred when we use bcftools view -Ou -o ? vcf.gz ? or bcf or vcf.bgz?

1
Entering edit mode

I have always used vcf.gz, although I've seen vcf.bgz probably to indicate that the compression has been performed with the bgzip algorithm rather than simply gzip. but this is simply informative for the human user, not for the software.

3
Entering edit mode
7.2 years ago
iraun ★ 4.5k

Something like this?

awk '/#/{print;next}{if($5 !~ /,/ && length($5)==1 && length($4)==1){print}}' file.vcf  Explanation: • /#/{print;next} ---> Print header lines. • if($5 !~ /,/ && length($5)==1 && length($4)==1){print} ---> If ALT column (5 column) does not contain a comma (there aren't ALT alleles), and REF ($4) and ALT ($5) columns have only one letter (SNP) print it.
2
Entering edit mode

A slightly faster perl alternative:

perl -lane 'if(/^#/ or length("$F[3]$F[4]")==2){print}' file.vcf

0
Entering edit mode

thanks it worked wery well and faster.

3
Entering edit mode
6.7 years ago
Hyunmin ▴ 30

Another suggestion is bcftools. bcftools has function of 'norm'.

It can be used with indel.

==

bcftools norm [OPTIONS] file.vcf.gz


Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows.

0
Entering edit mode
2.3 years ago
b.ambrozio ▴ 20

I was trying to do the same with all the 22 VCFs of the 1000 genome phase 3 datasets (~15G compressed) concatenated in a single VCF, but with bcftools was taking hours (more than 10 hours and still running). This is the command I was trying:

vcf_file=../../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
bcftools view -Oz --max-alleles 2 --known --min-af 0.01 $vcf_file >$vcf_file.subset.vcf.bgz


Than, in the mean time, I was looking for alternatives and I found out how to address using Plink2:

vcf_file=../../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
$./plink2 --recode vcf bgz --vcf$vcf_file --max-alleles 2 --min-alleles 2 --maf 0.01 --keep-allele-order --out ALL.phase3.biallelic-only


It was so much faster! Took only 28 min and 13 seconds to get it done. Here's command again, with the output log:

$./plink2 --recode vcf bgz --vcf$vcf_file --max-alleles 2 --min-alleles 2 --maf 0.01 --keep-allele-order --out ALL.phase3.biallelic-only
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.phase3.biallelic-only.log.
Options in effect:
--export vcf bgz
--keep-allele-order
--maf 0.01
--max-alleles 2
--min-alleles 2
--out ALL.phase3.biallelic-only
--vcf ../../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Start time: Wed Mar 18 22:52:54 2020
Note: --keep-allele-order no longer has any effect.
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 81271745 variants scanned.
--vcf: ALL.phase3.biallelic-only-temporary.pgen +
ALL.phase3.biallelic-only-temporary.pvar +
ALL.phase3.biallelic-only-temporary.psam written.
2504 samples (0 females, 0 males, 2504 ambiguous; 2504 founders) loaded from
ALL.phase3.biallelic-only-temporary.psam.
80855722 out of 81271745 variants loaded from
ALL.phase3.biallelic-only-temporary.pvar.
Note: No phenotype data present.
Calculating allele frequencies... done.
67430946 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
13424776 variants remaining after main filters.
--export vcf bgz to ALL.phase3.biallelic-only.vcf.gz ... done.
End time: Wed Mar 18 23:21:07 2020