Question: how to remove multiallelic from VCF
5
gravatar for xinhui.wang
5.4 years ago by
xinhui.wang250
Netherlands
xinhui.wang250 wrote:

I am working on VCF dorument and there are many indels and multiallelic in my VCF. However, it is not necessary for me to includ them in our analysis and I would like to remove them. Is there a simple way to do that with vcftools or bcftools?

Any help would be greatly appreciated.

Thanks and with best regards,

Xinhui

snp • 15k views
ADD COMMENTlink modified 6 months ago by b.ambrozio20 • written 5.4 years ago by xinhui.wang250
22
gravatar for Jorge Amigo
5.4 years ago by
Jorge Amigo12k
Santiago de Compostela, Spain
Jorge Amigo12k wrote:

You can filter anything you want using bcftools view. In your case, if you want to filter out indels and multiallelic, you would need something like this:

bcftools view --max-alleles 2 --exclude-types indels input.vcf.gz

A typical command to filter out anything but biallelic SNPs, as stated in the bcftools manual, is the following:

bcftools view -m2 -M2 -v snps input.vcf.gz
ADD COMMENTlink modified 19 months ago by RamRS30k • written 5.4 years ago by Jorge Amigo12k

really great and I used the second one since biallelic snps are we need.

ADD REPLYlink written 5.4 years ago by xinhui.wang250

be careful with this command since it gives biallelic SNPs only. if you want to extract all SNPs with AT MOST 2 alleles, (which is not strictly the same as BIALLELIC, then you need to remove the -m2 option.

ADD REPLYlink written 5.4 years ago by Jorge Amigo12k

Sorry. Do you mind to give me a small example of between biallelic snps with snps at most 2 alleles? Thanks and with best regards, Xinhui

ADD REPLYlink written 5.4 years ago by xinhui.wang250
1

strictly speaking all snps are at least biallelic, since the variant detection implies that there's other allele apart from the reference one, so you shouldn't worry if you're working with variants only. the vcf format allows you to define positions where you may have a reference allele but not a alternative allele, and those ones would be removed on the first code. this is indeed not common at all, but you always have to keep in mind what you're filtering out, so removing the -m2 is a way not to worry about that possible issue.

ADD REPLYlink written 5.4 years ago by Jorge Amigo12k

Not works well for me. rs1799966 is tri-allelic, however, rs1799966 still left after -m2 -M2 -v snps

bcftools view -f PASS -m2 -M2 -v snps -R VIP.hg19.bed gnomad.exomes.r2.1.sites.chr17.vcf.bgz > VIP.chr17.vcf 
grep rs1799966 VIP.chr17.vcf | sort -u | less -S
ADD REPLYlink modified 19 months ago by RamRS30k • written 19 months ago by Shicheng Guo8.3k
3

bcftools' documentation is very clear about this

Use -m2 -M2 -v snps to only view biallelic SNPs.

so if it isn't working it must be either a problem with bcftools (that'd be odd) or either a problem with gnomad data.

if you query gnomad data for that particular position you'll see that there are 2 entries instead of 1, each one with 2 alleles maximum, so the -M2 filter from bcftools won't be able to filter them:

$ bcftools view -r 17:41223094-41223094 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | grep -v ^# | cut -f1-7
17      41223094        rs1799966       T       A       2.93471e+08     PASS
17      41223094        rs1799966       T       C       2.93471e+08     PASS

if you want to do so, you'll have to join biallelic sites into multiallelic records in advance, using bcftools norm -m+ for instance:

$ bcftools view -r 17:41223094-41223094 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools norm -m+ | grep -v ^# | cut -f1-7
Lines   total/split/realigned/skipped:  2/0/0/0
17      41223094        rs1799966       T       A,C     2.93471e+08     PASS

and then try to filter your region of interest (I'm using 17:41223090-41223100 here as an example):

$ bcftools view -r 17:41223090-41223100 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools norm -m+ | bcftools view -m2 -M2 -v snps | grep -v ^# | cut -f1-7
Lines   total/split/realigned/skipped:  6/0/0/0
17      41223090        rs766305255     G       A       10717.2 PASS
17      41223091        rs70953660      G       A       140083  PASS

EDIT: first bcftools view is not really needed, since bcftools norm is already capable of slicing the data by region of interest:

$ bcftools norm -m+ -r 17:41223090-41223100 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools view -m2 -M2 -v snps | grep -v ^# | cut -f1-7
Lines   total/split/realigned/skipped:  6/0/0/0
17      41223090        rs766305255     G       A       10717.2 PASS
17      41223091        rs70953660      G       A       140083  PASS
ADD REPLYlink modified 19 months ago • written 19 months ago by Jorge Amigo12k

Yes. You are exactly right. That's the point I figure out later.

ADD REPLYlink written 19 months ago by Shicheng Guo8.3k

Another thing should be mentioned should be sort ? anyway, now I will sort the vcf first and then use -m2 -M2

ADD REPLYlink written 19 months ago by Shicheng Guo8.3k
1

note that the merging process when normalizing can be quite intense on such large files as gnomad's, so if you are interested in particular regions do not forget to indicate this in your normalization command. I've edited my answer to reflect this.

ADD REPLYlink written 19 months ago by Jorge Amigo12k

Hi Jorge, What kind of suffix would be preferred when we use bcftools view -Ou -o ? vcf.gz ? or bcf or vcf.bgz?

ADD REPLYlink modified 19 months ago • written 19 months ago by Shicheng Guo8.3k
1

I have always used vcf.gz, although I've seen vcf.bgz probably to indicate that the compression has been performed with the bgzip algorithm rather than simply gzip. but this is simply informative for the human user, not for the software.

ADD REPLYlink written 19 months ago by Jorge Amigo12k
3
gravatar for Hyunmin
5.0 years ago by
Hyunmin30
Seoul
Hyunmin30 wrote:

Another suggestion is bcftools. bcftools has function of 'norm'.

It can be used with indel.

==

bcftools norm [OPTIONS] file.vcf.gz

Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows.

ADD COMMENTlink modified 19 months ago by RamRS30k • written 5.0 years ago by Hyunmin30
2
gravatar for iraun
5.4 years ago by
iraun3.8k
Norway
iraun3.8k wrote:

Something like this?

awk '/#/{print;next}{if($5 !~ /,/ && length($5)==1 && length($4)==1){print}}' file.vcf

Explanation:

  • /#/{print;next} ---> Print header lines.
  • if($5 !~ /,/ && length($5)==1 && length($4)==1){print} ---> If ALT column (5 column) does not contain a comma (there aren't ALT alleles), and REF ($4) and ALT ($5) columns have only one letter (SNP) print it.
ADD COMMENTlink modified 19 months ago by RamRS30k • written 5.4 years ago by iraun3.8k
2

A slightly faster perl alternative:

perl -lane 'if(/^#/ or length("$F[3]$F[4]")==2){print}' file.vcf
ADD REPLYlink modified 19 months ago by RamRS30k • written 5.4 years ago by Jorge Amigo12k

thanks it worked wery well and faster.

ADD REPLYlink written 5.4 years ago by xinhui.wang250
0
gravatar for b.ambrozio
6 months ago by
b.ambrozio20
b.ambrozio20 wrote:

I was trying to do the same with all the 22 VCFs of the 1000 genome phase 3 datasets (~15G compressed) concatenated in a single VCF, but with bcftools was taking hours (more than 10 hours and still running). This is the command I was trying:

vcf_file=../../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
bcftools view -Oz --max-alleles 2 --known --min-af 0.01 $vcf_file > $vcf_file.subset.vcf.bgz

Than, in the mean time, I was looking for alternatives and I found out how to address using Plink2:

vcf_file=../../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz
$ ./plink2 --recode vcf bgz --vcf $vcf_file --max-alleles 2 --min-alleles 2 --maf 0.01 --keep-allele-order --out ALL.phase3.biallelic-only

It was so much faster! Took only 28 min and 13 seconds to get it done. Here's command again, with the output log:

$ ./plink2 --recode vcf bgz --vcf $vcf_file --max-alleles 2 --min-alleles 2 --maf 0.01 --keep-allele-order --out ALL.phase3.biallelic-only
PLINK v2.00a2.3 64-bit (24 Jan 2020)           www.cog-genomics.org/plink/2.0/
(C) 2005-2020 Shaun Purcell, Christopher Chang   GNU General Public License v3
Logging to ALL.phase3.biallelic-only.log.
Options in effect:
  --export vcf bgz
  --keep-allele-order
  --maf 0.01
  --max-alleles 2
  --min-alleles 2
  --out ALL.phase3.biallelic-only
  --vcf ../../ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz

Start time: Wed Mar 18 22:52:54 2020
Note: --keep-allele-order no longer has any effect.
16384 MiB RAM detected; reserving 8192 MiB for main workspace.
Using up to 8 compute threads.
--vcf: 81271745 variants scanned.
--vcf: ALL.phase3.biallelic-only-temporary.pgen +
ALL.phase3.biallelic-only-temporary.pvar +
ALL.phase3.biallelic-only-temporary.psam written.
2504 samples (0 females, 0 males, 2504 ambiguous; 2504 founders) loaded from
ALL.phase3.biallelic-only-temporary.psam.
80855722 out of 81271745 variants loaded from
ALL.phase3.biallelic-only-temporary.pvar.
Note: No phenotype data present.
Calculating allele frequencies... done.
67430946 variants removed due to allele frequency threshold(s)
(--maf/--max-maf/--mac/--max-mac).
13424776 variants remaining after main filters.
--export vcf bgz to ALL.phase3.biallelic-only.vcf.gz ... done.
End time: Wed Mar 18 23:21:07 2020
ADD COMMENTlink written 6 months ago by b.ambrozio20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1697 users visited in the last hour