Question: how to remove multiallelic from VCF
3
gravatar for xinhui.wang
4.5 years ago by
xinhui.wang180
Netherlands
xinhui.wang180 wrote:

I am working on VCF dorument and there are many indels and multiallelic in my VCF. However, it is not necessary for me to includ them in our analysis and I would like to remove them. Is there a simple way to do that with vcftools or bcftools?

Any help would be greatly appreciated.

Thanks and with best regards,

Xinhui

snp • 11k views
ADD COMMENTlink modified 4.1 years ago by Hyunmin20 • written 4.5 years ago by xinhui.wang180
12
gravatar for Jorge Amigo
4.5 years ago by
Jorge Amigo11k
Santiago de Compostela, Spain
Jorge Amigo11k wrote:

You can filter anything you want using bcftools view. In your case, if you want to filter out indels and multiallelic, you would need something like this:

bcftools view --max-alleles 2 --exclude-types indels input.vcf.gz

A typical command to filter out anything but biallelic SNPs, as stated in the bcftools manual, is the following:

bcftools view -m2 -M2 -v snps input.vcf.gz
ADD COMMENTlink modified 9 months ago by RamRS24k • written 4.5 years ago by Jorge Amigo11k

really great and I used the second one since biallelic snps are we need.

ADD REPLYlink written 4.5 years ago by xinhui.wang180

be careful with this command since it gives biallelic SNPs only. if you want to extract all SNPs with AT MOST 2 alleles, (which is not strictly the same as BIALLELIC, then you need to remove the -m2 option.

ADD REPLYlink written 4.5 years ago by Jorge Amigo11k

Sorry. Do you mind to give me a small example of between biallelic snps with snps at most 2 alleles? Thanks and with best regards, Xinhui

ADD REPLYlink written 4.5 years ago by xinhui.wang180
1

strictly speaking all snps are at least biallelic, since the variant detection implies that there's other allele apart from the reference one, so you shouldn't worry if you're working with variants only. the vcf format allows you to define positions where you may have a reference allele but not a alternative allele, and those ones would be removed on the first code. this is indeed not common at all, but you always have to keep in mind what you're filtering out, so removing the -m2 is a way not to worry about that possible issue.

ADD REPLYlink written 4.5 years ago by Jorge Amigo11k

Not works well for me. rs1799966 is tri-allelic, however, rs1799966 still left after -m2 -M2 -v snps

bcftools view -f PASS -m2 -M2 -v snps -R VIP.hg19.bed gnomad.exomes.r2.1.sites.chr17.vcf.bgz > VIP.chr17.vcf 
grep rs1799966 VIP.chr17.vcf | sort -u | less -S
ADD REPLYlink modified 9 months ago by RamRS24k • written 9 months ago by Shicheng Guo7.8k
2

bcftools' documentation is very clear about this

Use -m2 -M2 -v snps to only view biallelic SNPs.

so if it isn't working it must be either a problem with bcftools (that'd be odd) or either a problem with gnomad data.

if you query gnomad data for that particular position you'll see that there are 2 entries instead of 1, each one with 2 alleles maximum, so the -M2 filter from bcftools won't be able to filter them:

$ bcftools view -r 17:41223094-41223094 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | grep -v ^# | cut -f1-7
17      41223094        rs1799966       T       A       2.93471e+08     PASS
17      41223094        rs1799966       T       C       2.93471e+08     PASS

if you want to do so, you'll have to join biallelic sites into multiallelic records in advance, using bcftools norm -m+ for instance:

$ bcftools view -r 17:41223094-41223094 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools norm -m+ | grep -v ^# | cut -f1-7
Lines   total/split/realigned/skipped:  2/0/0/0
17      41223094        rs1799966       T       A,C     2.93471e+08     PASS

and then try to filter your region of interest (I'm using 17:41223090-41223100 here as an example):

$ bcftools view -r 17:41223090-41223100 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools norm -m+ | bcftools view -m2 -M2 -v snps | grep -v ^# | cut -f1-7
Lines   total/split/realigned/skipped:  6/0/0/0
17      41223090        rs766305255     G       A       10717.2 PASS
17      41223091        rs70953660      G       A       140083  PASS

EDIT: first bcftools view is not really needed, since bcftools norm is already capable of slicing the data by region of interest:

$ bcftools norm -m+ -r 17:41223090-41223100 https://storage.googleapis.com/gnomad-public/release/2.1/vcf/exomes/gnomad.exomes.r2.1.sites.chr17.vcf.bgz | bcftools view -m2 -M2 -v snps | grep -v ^# | cut -f1-7
Lines   total/split/realigned/skipped:  6/0/0/0
17      41223090        rs766305255     G       A       10717.2 PASS
17      41223091        rs70953660      G       A       140083  PASS
ADD REPLYlink modified 9 months ago • written 9 months ago by Jorge Amigo11k

Yes. You are exactly right. That's the point I figure out later.

ADD REPLYlink written 9 months ago by Shicheng Guo7.8k

Another thing should be mentioned should be sort ? anyway, now I will sort the vcf first and then use -m2 -M2

ADD REPLYlink written 9 months ago by Shicheng Guo7.8k
1

note that the merging process when normalizing can be quite intense on such large files as gnomad's, so if you are interested in particular regions do not forget to indicate this in your normalization command. I've edited my answer to reflect this.

ADD REPLYlink written 9 months ago by Jorge Amigo11k

Hi Jorge, What kind of suffix would be preferred when we use bcftools view -Ou -o ? vcf.gz ? or bcf or vcf.bgz?

ADD REPLYlink modified 9 months ago • written 9 months ago by Shicheng Guo7.8k
1

I have always used vcf.gz, although I've seen vcf.bgz probably to indicate that the compression has been performed with the bgzip algorithm rather than simply gzip. but this is simply informative for the human user, not for the software.

ADD REPLYlink written 9 months ago by Jorge Amigo11k
2
gravatar for iraun
4.5 years ago by
iraun3.6k
Norway
iraun3.6k wrote:

Something like this?

awk '/#/{print;next}{if($5 !~ /,/ && length($5)==1 && length($4)==1){print}}' file.vcf

Explanation:

  • /#/{print;next} ---> Print header lines.
  • if($5 !~ /,/ && length($5)==1 && length($4)==1){print} ---> If ALT column (5 column) does not contain a comma (there aren't ALT alleles), and REF ($4) and ALT ($5) columns have only one letter (SNP) print it.
ADD COMMENTlink modified 9 months ago by RamRS24k • written 4.5 years ago by iraun3.6k
2

A slightly faster perl alternative:

perl -lane 'if(/^#/ or length("$F[3]$F[4]")==2){print}' file.vcf
ADD REPLYlink modified 9 months ago by RamRS24k • written 4.5 years ago by Jorge Amigo11k

thanks it worked wery well and faster.

ADD REPLYlink written 4.5 years ago by xinhui.wang180
2
gravatar for Hyunmin
4.1 years ago by
Hyunmin20
Seoul
Hyunmin20 wrote:

Another suggestion is bcftools. bcftools has function of 'norm'.

It can be used with indel.

==

bcftools norm [OPTIONS] file.vcf.gz

Left-align and normalize indels, check if REF alleles match the reference, split multiallelic sites into multiple rows; recover multiallelics from multiple rows.

ADD COMMENTlink modified 9 months ago by RamRS24k • written 4.1 years ago by Hyunmin20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1768 users visited in the last hour