Question: Remove duplicate SNPs only based on SNP ID in bcftools
0
gravatar for janhuang.cn
2.2 years ago by
janhuang.cn150
janhuang.cn150 wrote:

I want to remove duplicate SNPs only according to SNP ID (e.g. rs123), from a vcf file.

I used the below command, but it duplicate SNPs still exist. Because they have different bp or allele. (I want to keep only the unique SNP ID)

bcftools norm --remove-duplicates --output file_rmvdup.vcf.gz file.vcf.gz

Then I tried the below command. But It says my vcf is not compressed.

bcftools concat --rm-dup all -a --output file_rmvdup.vcf.gz  file.vcf.gz

So I generate a compressed vcf file, using the below command.

bcftools view file.vcf.gz -Oz -o compressed_file.vcf.gz

But when I run the below command, it returns "could not load index"

bcftools concat --rm-dup all -a --output file_rmvdup.vcf.gz  compressed_file.vcf.gz
snp bcftools duplicate • 3.0k views
ADD COMMENTlink modified 2.2 years ago by Kevin Blighe52k • written 2.2 years ago by janhuang.cn150
3
gravatar for Kevin Blighe
2.2 years ago by
Kevin Blighe52k
Kevin Blighe52k wrote:

Edit (22nd August 2018):

The word 'duplicate' can mean different things in genetics:

  • duplicate variant ID
  • duplicate position
  • duplicate ref > alt change

It is critical that you clarify what it is that you are regarding as duplicated.

Also note that you can control the output format from most BCFtools commands with -O / --output-type (you have not specified this in your commands, meaning that the default, uncompressed VCF, is used):

-O, --output-type <type>          'b' compressed BCF; 'u' uncompressed BCF; 
                                  'z' compressed VCF; 'v' uncompressed VCF [v]

---------------------------------------

Solution 1

Before removing duplicates based on the VCF ID field by this method, I would advise you to first normalise your file (left-aligning indels) and splitting multi-allelic calls. Then, convert it to BCF before trying to remove duplicate IDs. Here is some code to normalise and index it:

bcftools norm -m-any file_rmvdup.vcf.gz | \
bcftools norm --check-ref w -f human_g1k_v37.fasta | \
bcftools annotate -Ob -I +'%ID' > file_rmvdup.bcf ;

bcftools index file_rmvdup.bcf ;

Note that you require a reference FASTA for the middle part of the first command (this step is not really necessary as it will just issue a warning where the REF in your file does not match the reference base in your supplied reference FASTA).

I've also added the bcftools annotate command in order to show you how you could modify the value of the ID field in real time. Most sources set the VCF ID field to rs ID, which is not in any way unique. The only way to set a unique value for the ID field (with some rare exceptions!) is with:

bcftools annotate -Ob -x 'ID' -I +'%CHROM:%POS:%REF:%ALT'

(bcftools annotate -Ob -I +'%ID', in my original command, just retains whatever ID is already set)

Then, if you literally want to remove records with a duplicate ID, then this works; however, it will only keep the first match, which may be dangerous if it's not the variant allele of interest:

bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | awk '/^#/{print}; !/^#/{if (!uniq[$3]++) print}'

Solution 2

A much safer solution is to simply merge all multi-allelic 'duplicate' calls back together. This preserves all information that you had in your previous file but may be problematic if you don't want multi-allelic calls (some programs don't like working with multi-alleles):

bcftools norm -Ov -m+any chr15_1000Gphase3_EUR_snp_maf.bcf

bcftools norm -Ov -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs10152211"
15  27221345    rs10152211  C   G,T 100 PASS

bcftools norm -Ov -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs115550614"
15  34708366    rs115550614 G   A,C 100 PASS

By solution 1, we would have lost a genotype for each of these example rs IDs because there are 2 variants per rs ID, in this case.

Solution 3

Just use the --rm-dup flag from bcftools norm and many other BCFtools commands


With that, hopefully you can see my original point about first clarifying what it is that you are regarding as duplicated.

Kevin

ADD COMMENTlink modified 15 months ago • written 2.2 years ago by Kevin Blighe52k

Thank you.

What is the purpose of normalisation? I am working on it. I download the human_g1k_v37.fasta.gz file from the 1000G FTP. But I cannot uncompress it to .fasta, my Mac return Error 32 - Broken pipe.

ADD REPLYlink written 2.2 years ago by janhuang.cn150

It ensures that multi-allelic sites are split into multiple rows in the VCF, and also that your ref alleles match the sequence in the reference genome FASTA. It may or may not be important for your work.

The reference FASTA has to be uncompressed, not as gzip.

ADD REPLYlink written 2.2 years ago by Kevin Blighe52k

I successfully normalize my file and got the bcf file. But when I apply the below command to remove duplicate, it returns error. "-D is functional but deprecated, replaced by -d both."

bcftools norm  --remove-duplicates --output file_rmvdup.vcf normalized.bcf

So I tried this, but it does not remove duplicate snps

bcftools norm --rm-dup both --output file_rmvdup.vcf normalized.bcf
ADD REPLYlink written 2.2 years ago by janhuang.cn150

Can you possibly paste an example of your VCF here so that I can test?

ADD REPLYlink written 2.2 years ago by Kevin Blighe52k

The original VCF is huge. I share this bcf file I got in the below link. And also where the original VCF is from in the word documents. Thank you so much.

https://www.dropbox.com/sh/sw781hk4n7t1ntv/AABYfT80cBMSFuzKooE_1rKea?dl=0

ADD REPLYlink written 2.2 years ago by janhuang.cn150
1

Thanks for sharing! I'm now beginning to understand why it's not removing these 'duplicate' rs IDs. It's due to the behaviour of BCFtools. There is an indirect way to do it with BCFtools, but also another solution with awk.

As you mentioned, many of the records have the same rs ID but alternate bases, which is perfectly reasonable (a rs ID can have multiple variant alleles). If we take a look at some of these:

bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs10152211" | cut -f1-7
15  27221345    rs10152211  C   G   100 PASS
15  27221345    rs10152211  C   T   100 PASS
bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs115550614" | cut -f1-7
15  34708366    rs115550614 G   A   100 PASS
15  34708366    rs115550614 G   C   100 PASS

I did some exploring and found the duplicate IDs with this command:

bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | cut -f3 | grep -e "^rs" | awk '{a[$0]++; if(a[$0]==2) print; if (a[$0]>=2) print}'

Solution 1

as above in edited original answer

Solution 2

as above in edited original answer

ADD REPLYlink modified 15 months ago • written 2.2 years ago by Kevin Blighe52k
1

Thank you so much!

I now use the below command to normalize and merge the multi-allelic, and then save as bcf. An updated bcf file is available here: https://www.dropbox.com/sh/sw781hk4n7t1ntv/AABYfT80cBMSFuzKooE_1rKea?dl=0

bcftools norm -Ou -m+any chr15_1000Gphase3_EUR_snp_maf.vcf.gz  | bcftools norm -Ou --fasta-ref human_g1k_v37.fasta | bcftools annotate -Ob -I +'%ID' > chr15_1000Gphase3_EUR_snp_maf.bcf

But when I check the duplicate SNPs using the below command. There are a few duplicate SNPs remained. I think these were not merged because they have different bp? or they have same alleles?

bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | cut -f3 | grep -e "^rs" | awk '{a[$0]++; if(a[$0]==2) print; if (a[$0]>=2) print}'

# example of duplicate snp
bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs1610794" | cut -f1-7

15      82635194        rs1610794       T       C       100     PASS
15      82932518        rs1610794       T       C       100     PASS

Therefore, for these SNPs I would like to use the Solution 1 you mentioned to keep only one call. I made some revision because I want to save the result as rmvdup.vcf, while your command above applies "bcftools view" to print the result. But then I got stuck here, I tried to place "-Ou" at different position, but the command cannot run, and no vcf was saved.

bcftools norm -Ou chr15_1000Gphase3_EUR_snp_maf_rmvdup.vcf chr15_1000Gphase3_EUR_snp_maf.bcf | awk '!uniq[$3]++'
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by janhuang.cn150

Okay, this situation is bizarre! I have never seen this before where the same rs ID has different positions in a 1000 Genomes file - I think that these are errors but I cannot be sure.

I used this command to find the SNPs that behave this way:

bcftools norm -Ov -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | bcftools norm -Ov --check-ref w --fasta-ref human_g1k_v37.fasta | cut -f3 | grep -e "^rs" | awk '{a[$0]++; if(a[$0]==2) print; if (a[$0]>=2) print}'
rs145926341
rs145926341
rs542232278
rs542232278
rs1610794
rs1610794

bcftools view chr15_1000Gphase3_EUR_snp_maf.bcf | grep -e "rs145926341" -e "rs542232278" -e "rs1610794" | cut -f1-7
15  82605335    rs145926341 C   A   100 PASS
15  82626078    rs542232278 C   G   100 PASS
15  82635194    rs1610794   T   C   100 PASS
15  82902661    rs145926341 C   A   100 PASS
15  82923405    rs542232278 C   G   100 PASS
15  82932518    rs1610794   T   C   100 PASS

The way to combine my 2 solutions together is as follows (note that the bcftools annotate command is not necessary):

bcftools norm -Ou -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | bcftools norm -Ov --check-ref w --fasta-ref human_g1k_v37.fasta | awk '!uniq[$3]++'

The only problem is that, of the 3 duplicate SNPs that remained, we now have rs542232278 at the incorrect position. I checked each of these SNPs in the UCSC genome browser and the correct positions are:

15 82605335 rs145926341 C A

15 82923405 rs542232278 C G

15 82635194 rs1610794 T C

If you want to ensure that you have no duplicates and also have the correct SNP position for the 3 strange duplicates, then the following works:

bcftools norm -Ou -m+any chr15_1000Gphase3_EUR_snp_maf.bcf | bcftools norm -Ov --check-ref w --fasta-ref human_g1k_v37.fasta | awk '{if ($2$3!="82626078rs542232278" && $2$3!="82902661rs145926341" && $2$3!="82932518rs1610794") print $0}'

It may be very annoying doing this for each chromosome but, hey, this is the way that bioinformatics is!

Note that the --check-ref parameter should warn when a reference allele in the VCF is not compatible with the reference FASTA. However, the 3 strange duplicates avoid this because the reference base at each coincidentally matches the base in the reference genome too.

Regarding bcftools annotate, it is useful when you want to change the value in the ID field. For example, rs IDs can be difficult to work with, so, to make variant truly unique could could change the VCF ID field to chr:pos:ref:alt with:

bcftools annotate -Ov -x 'ID' -I +'%CHROM:%POS:%REF:%ALT' MyFile.bcf
ADD REPLYlink modified 2.2 years ago • written 2.2 years ago by Kevin Blighe52k
2

Here is a possible explanation of Multiple Mapping Positions for a Single SNP

I think I should consider changing the ID to '%CHROM:%POS:%ID'. So I used this command to merge multi-allelic, to normalize, and to replace the ID with chr:pos:id.

bcftools norm -Ou -m+any chr15_1000Gphase3_EUR_snp_maf.bcf  | bcftools norm -Ou --fasta-ref human_g1k_v37.fasta | bcftools annotate --output-type b --output chr15_1000Gphase3_EUR_snp_maf_chrposrsid.bcf  -I '%CHROM:%POS:%ID'

chr15_1000Gphase3_EUR_snp_maf.bcf is available here

And then I used this command to check if any duplicate exists in the new ID column

bcftools view chr15_1000Gphase3_EUR_snp_maf_chrposrsid.bcf | cut -f3  | awk '{a[$0]++; if(a[$0]==2) print; if (a[$0]>=2) print}'

And this command to check the duplicate rs145926341, and now if I use the new ID column (chr:pos:rsid), these two records are different.

bcftools view  chr15_1000Gphase3_EUR_snp_maf_chrposrsid.bcf  | grep -e "rs145926341" | cut -f1-7

15      82605335        15:82605335:rs145926341 C       A       100     PASS
15      82902661        15:82902661:rs145926341 C       A       100     PASS

Thank you so much again.

ADD REPLYlink written 2.2 years ago by janhuang.cn150

Looks great! - thanks. No problem.

Thanks regarding the link to - very interesting!

ADD REPLYlink written 2.2 years ago by Kevin Blighe52k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 941 users visited in the last hour