how to remove multiallelic from VCF
4
10
Entering edit mode
6.5 years ago
xinhui.wang ▴ 400

I am working on VCF dorument and there are many indels and multiallelic in my VCF. However, it is not necessary for me to includ them in our analysis and I would like to remove them. Is there a simple way to do that with vcftools or bcftools?

Any help would be greatly appreciated.

Thanks and with best regards,

Xinhui

SNP • 20k views
ADD COMMENT
25
Entering edit mode
6.5 years ago

You can filter anything you want using bcftools view. In your case, if you want to filter out indels and multiallelic, you would need something like this:

bcftools view --max-alleles 2 --exclude-types indels input.vcf.gz

A typical command to filter out anything but biallelic SNPs, as stated in the bcftools manual, is the following:

bcftools view -m2 -M2 -v snps input.vcf.gz
ADD COMMENT
0
Entering edit mode

really great and I used the second one since biallelic snps are we need.

ADD REPLY
1
Entering edit mode

be careful with this command since it gives biallelic SNPs only. if you want to extract all SNPs with AT MOST 2 alleles, (which is not strictly the same as BIALLELIC, then you need to remove the -m2 option.

ADD REPLY
0
Entering edit mode

Sorry. Do you mind to give me a small example of between biallelic snps with snps at most 2 alleles? Thanks and with best regards, Xinhui

ADD REPLY
1
Entering edit mode

strictly speaking all snps are at least biallelic, since the variant detection implies that there's other allele apart from the reference one, so you shouldn't worry if you're working with variants only. the vcf format allows you to define positions where you may have a reference allele but not a alternative allele, and those ones would be removed on the first code. this is indeed not common at all, but you always have to keep in mind what you're filtering out, so removing the -m2 is a way not to worry about that possible issue.

ADD REPLY
0
Entering edit mode

Not works well for me. rs1799966 is tri-allelic, however, rs1799966 still left after -m2 -M2 -v snps

bcftools view -f PASS -m2 -M2 -v snps -R VIP.hg19.bed gnomad.exomes.r2.1.sites.chr17.vcf.bgz > VIP.chr17.vcf 
grep rs1799966 VIP.chr17.vcf | sort -u | less -S
ADD REPLY
3
Entering edit mode

bcftools' documentation is very clear about this

Use -m2 -M2 -v snps to only view biallelic SNPs.

so if it isn't working it must be either a problem with bcftools (that'd be odd) or either a problem with gnomad data.

if you query gnomad data for that particular position you'll see that there are 2 entries instead of 1, each one with 2 alleles maximum, so the -M2 filter from bcftools won't be able to filter them:

$ bcftools view -r 17:41223094-41223094 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | grep -v ^# | cut -f1-7
17      41223094        rs1799966       T       A       2.93471e+08     PASS
17      41223094        rs1799966       T       C       2.93471e+08     PASS

if you want to do so, you'll have to join biallelic sites into multiallelic records in advance, using bcftools norm -m+ for instance:

$ bcftools view -r 17:41223094-41223094 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools norm -m+ | grep -v ^# | cut -f1-7
Lines   total/split/realigned/skipped:  2/0/0/0
17      41223094        rs1799966       T       A,C     2.93471e+08     PASS

and then try to filter your region of interest (I'm using 17:41223090-41223100 here as an example):

$ bcftools view -r 17:41223090-41223100 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools norm -m+ | bcftools view -m2 -M2 -v snps | grep -v ^# | cut -f1-7
Lines   total/split/realigned/skipped:  6/0/0/0
17      41223090        rs766305255     G       A       10717.2 PASS
17      41223091        rs70953660      G       A       140083  PASS

EDIT: first bcftools view is not really needed, since bcftools norm is already capable of slicing the data by region of interest:

$ bcftools norm -m+ -r 17:41223090-41223100 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools view -m2 -M2 -v snps | grep -v ^# | cut -f1-7
Lines   total/split/realigned/skipped:  6/0/0/0
17      41223090        rs766305255     G       A       10717.2 PASS
17      41223091        rs70953660      G       A       140083  PASS
ADD REPLY
0
Entering edit mode

Yes. You are exactly right. That's the point I figure out later.

ADD REPLY
0
Entering edit mode

Another thing should be mentioned should be sort ? anyway, now I will sort the vcf first and then use -m2 -M2

ADD REPLY
1
Entering edit mode

note that the merging process when normalizing can be quite intense on such large files as gnomad's, so if you are interested in particular regions do not forget to indicate this in your normalization command. I've edited my answer to reflect this.

ADD REPLY
0
Entering edit mode

Hi Jorge, What kind of suffix would be preferred when we use bcftools view -Ou -o ? vcf.gz ? or bcf or vcf.bgz?

ADD REPLY
1
Entering edit mode

I have always used vcf.gz, although I've seen vcf.bgz probably to indicate that the compression has been performed with the bgzip algorithm rather than simply gzip. but this is simply informative for the human user, not for the software.

ADD REPLY
3
Entering edit mode
6.5 years ago
iraun ★ 3.8k

Something like this?

awk '/#/{print;next}{if($5 !~ /,/ && length($5)==1 && length($4)==1){print}}' file.vcf

Explanation:

  • /#/{print;next} ---> Print header lines.
  • if($5 !~ /,/ && length($5)==1 && length($4)==1){print} ---> If ALT column (5 column) does not contain a comma (there aren't ALT alleles), and REF ($4) and ALT ($5) columns have only one letter (SNP) print it.
ADD COMMENT
2
Entering edit mode

A slightly faster perl alternative:

perl -lane 'if(/^#/ or length("$F[3]$F[4]")==2){print}' file.vcf
ADD REPLY
0
Entering edit mode

thanks it worked wery well and faster.

ADD REPLY
3
Entering edit mode
6.0 years ago
Hyunmin ▴ 30

Another suggestion is bcftools. bcftools has function of 'norm'.

It can be used with indel.

==

bcftools norm [OPTIONS] file.vcf.gz

Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows.

ADD COMMENT
0
Entering edit mode
19 months ago
b.ambrozio ▴ 20

I was trying to do the same with all the 22 VCFs of the 1000 genome phase 3 datasets (~15G compressed) concatenated in a single VCF, but with bcftools was taking hours (more than 10 hours and still running). This is the command I was trying:

vcf_file=../../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
bcftools view -Oz --max-alleles 2 --known --min-af 0.01 $vcf_file > $vcf_file.subset.vcf.bgz

Than, in the mean time, I was looking for alternatives and I found out how to address using Plink2:

vcf_file=../../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
$ ./plink2 --recode vcf bgz --vcf $vcf_file --max-alleles 2 --min-alleles 2 --maf 0.01 --keep-allele-order --out ALL.phase3.biallelic-only

It was so much faster! Took only 28 min and 13 seconds to get it done. Here's command again, with the output log:

$ ./plink2 --recode vcf bgz --vcf $vcf_file --max-alleles 2 --min-alleles 2 --maf 0.01 --keep-allele-order --out ALL.phase3.biallelic-only
PLINK v2.00a2.3 64-bit (24 Jan 2020)           www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.phase3.biallelic-only.log.
Options in effect:
  --export vcf bgz
  --keep-allele-order
  --maf 0.01
  --max-alleles 2
  --min-alleles 2
  --out ALL.phase3.biallelic-only
  --vcf ../../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Start time: Wed Mar 18 22:52:54 2020
Note: --keep-allele-order no longer has any effect.
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 81271745 variants scanned.
--vcf: ALL.phase3.biallelic-only-temporary.pgen +
ALL.phase3.biallelic-only-temporary.pvar +
ALL.phase3.biallelic-only-temporary.psam written.
2504 samples (0 females, 0 males, 2504 ambiguous; 2504 founders) loaded from
ALL.phase3.biallelic-only-temporary.psam.
80855722 out of 81271745 variants loaded from
ALL.phase3.biallelic-only-temporary.pvar.
Note: No phenotype data present.
Calculating allele frequencies... done.
67430946 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
13424776 variants remaining after main filters.
--export vcf bgz to ALL.phase3.biallelic-only.vcf.gz ... done.
End time: Wed Mar 18 23:21:07 2020
ADD COMMENT

Login before adding your answer.

Traffic: 1936 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6