Question

Merging vcf files (intersection and union)

1

Entering edit mode

6.0 years ago

wangdavid758 ▴ 30

How can I merge 2 vcf files, each with multiple samples, the so the final file has all the samples but the intersection of the SNPs. How about union on the SNPs?

vcf merge • 8.6k views

ADD COMMENT • link updated 5.8 years ago by ami23.sarr • 0 • written 6.0 years ago by wangdavid758 ▴ 30

0

Entering edit mode

hello kevin, i have a snp typing file whose genotypes are in ATGC format that i want to convert to 0,1,2(Genotypes must be coded as dosage of allele 'b' 0, 1, 2.) your help is welcome Genotype calls: 0: A1A1 1: A1A2 or A2A1 2: A2A2 5: missing i use plink and my file is .csv format. i can do send my file if yes

ADD REPLY • link 5.8 years ago by ami23.sarr • 0

0

Entering edit mode

Hello, you may consider opening up a new question for this, and make sure that you add the plink tag.

ADD REPLY • link 5.8 years ago by Kevin Blighe 87k

0

Entering edit mode

Hi kevin i open a new question

ADD REPLY • link 5.8 years ago by ami23.sarr • 0

1

Entering edit mode

Yes, but there was no need to delete that post :-)

ADD REPLY • link 5.8 years ago by Ram 43k

score 16 · Accepted Answer · 2018-03-31

16

Entering edit mode

6.0 years ago

Kevin Blighe 87k

There are multiple questions in your post but I'll make some answers that I believe will answer all of them.

Firstly, I recommend that you normalise your VCFs so that everything else that you do will work as planned. 'Normalising' a VCF usally means splitting multi-allelic calls, left-aligning indels, and ensuring that the alleles specified as REF in your VCF are also present in the reference genome. You can do this with (for each VCF file):

#1st pipe, splits multi-allelic calls into separate variant calls
#2nd pipe, left-aligns indels and issues warnings when the REF base in your VCF does not match the base in the supplied FASTA reference genome
bcftools norm -m-any MyVariants1.VCF | bcftools norm -Ob --check-ref w -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta > MyVariants1.norm.bcf ;
bcftools norm -m-any MyVariants2.VCF | bcftools norm -Ob --check-ref w -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta > MyVariants2.norm.bcf ;

bcftools index MyVariants1.norm.bcf ;
bcftools index MyVariants2.norm.bcf ;

You can then merge these together with:

bcftools merge -Ob -m none MyVariants1.norm.bcf MyVariants2.norm.bcf > MyVariants.norm.merge.bcf ;

--------------------------------------------

For the intersection, etc, you may additionally consider setting a unique value for each variant in the ID column. The following line will eliminate whatever is already set for the ID field and then set a new value:

bcftools annotate -Ob -x 'ID' -I +'%CHROM:%POS:%REF:%ALT'

You can then do various operations using bcftools isec. There are examples given in the program when you just type bcftools isec:

-------------------------------------------

Hope that this helps.

I have done intersections and other complex joins before, but by using AWK indexed arrays (can use Python, too) after having set the ID field to a unique value, like I did in the code above. Sometimes, BCFtools, VCFtools, etc., do not behave as you expect.

Kevin

ADD COMMENT • link 3.2 years ago by Kevin Blighe 87k

0

Entering edit mode

Hello Kevin,

I am vinay kumar reddy nannuru, i have a vcf file of 92 samples with 14000 variants in each and i want to merge with another publicly available dataset consists of 30,000 samples with 950,000 variants. How can i merge them by having all samples in same output file with similar variant positions in all samples of output. Could you please give me a clear explanation. And second question how can i select some subset of samples from the second dataset of 30000 samples. Thank for your time.

Vinay Kumar Reddy Nannuru

ADD REPLY • link 5.8 years ago by vinayreddynannuru ▴ 20

1

Entering edit mode

Hello Vinay,

Both of your datasets are in a VCF? - you have 2 VCFs, therefore?

To faithfully merge these, you should first normalise them both in the same way. By 'normalising', we split multi-allelic calls (which would make merging difficult), left-align indels, and also check that the reference bases line up to a chosen reference genome. For the reference check to work, the contigs in your VCF have to match those in the reference, and obviously your original data had to have been aligned to the chosen reference (note that this step is no actually necessary - it's more important to normalise with the first command before the pipe symbol):

bcftools norm -m-any MyData.VCF | bcftools norm -Ob --check-ref w -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta > MyData.norm.bcf ;
bcftools index MyData.bcf ;

bcftools norm -m-any PublicData.vcf | bcftools norm -Ob --check-ref w -f /ReferenceMaterial/1000Genomes/human_g1k_v37.fasta > PublicData.norm.bcf ;
bcftools index PublicData.bcf ;

Then, we can merge them by only including variants with a PASS filter:

bcftools merge -f PASS -Ob -m none MyData.bcf PublicData.bcf -o ProjectMerge.bcf ;

...or just merge everything irrespective of FILTER flags:

bcftools merge -Ob -m none MyData.bcf PublicData.bcf -o ProjectMerge.bcf ;

ADD REPLY • link 5.6 years ago by Kevin Blighe 87k

1

Entering edit mode

Hello Kevin,

Thank you for answer and the thing is both my project data and public data both are aligned to same reference genome already. so can i now directly proceed by merging them.

thanks, vinay kumar reddy nannuru

ADD REPLY • link 5.8 years ago by vinayreddynannuru ▴ 20

0

Entering edit mode

Yes, you can proceed

ADD REPLY • link 5.7 years ago by Kevin Blighe 87k

0

Entering edit mode

I'd recommending normalizing them before any other operation, that removes a lot of unpredictability.

ADD REPLY • link 5.7 years ago by Ram 43k