Question: VCF merge or concatenate?
gravatar for rokragna295
19 months ago by
rokragna29510 wrote:

I would like to double check whether to use VCF concatenate or VCF merge on my chromosome files. I have done SNP calling using FreeBayes, but split this by chromosomes in order to call SNPs in parallel. I also split some particularly large chromosomes by chromosome position with no overlap, e.g. 1:500,000; 500,001:1,000,000, with the end result being:





I want to first combine the separate VCF files for single, extra data heavy chromosomes that have been split into multiple VCF files for processing. I.e.

Chromosome1-position-1:500,000 + Chromosome1-position-500,001:1,000,000 --> Chromosome1

Then I want to combine all of the separate VCF files into 1, i.e.

Chromosome 1 {merged from step 1} + Chromosome 2 + Chromosome 3 --> SNPs

The VCF tools manual leads me to believe that using VCF-concatenate is the appropriate command for both as they are all separate files of separate chromosomes that I just need to re-attach back to each other but I'm unsure if this is the case. Any advice would be appreciated.

snp vcftools vcf • 2.0k views
ADD COMMENTlink modified 19 months ago by harish320 • written 19 months ago by rokragna29510

Not what you are asking for, but the currently recommended tool for things like this is bcftools, and no longer VCFtools.

ADD REPLYlink written 19 months ago by WouterDeCoster45k
gravatar for bari.ballew
19 months ago by
bari.ballew260 wrote:

Take a look at bcftools concat (

In general, concatenate means adding on rows to your vcf (e.g. re-combining split chromosomes), while merge means creating a superset of variant calls across multiple individuals. Concatenating is relatively straightforward, you just need to keep your headers straight (e.g. going from multiple files each with their own headers, to one file with one set of headers, and ideally an additional header row documenting the command used to concatenate).

Merging across individuals can be more problematic. If you have a gvcf, which represents all positions whether they contain a variant or not, you can merge fairly easily. However, if you only have vcfs, which only report a genomic location if there is a variant in that individual, you encounter a missing data problem. When a variant is reported in A.vcf, but not in B.vcf, the merged file will record the variant as missing "./." for sample B. Does that mean there was insufficient coverage to make a call, or was there plenty of coverage and simply no variant reads? It doesn't sound like this is what you're doing here, so don't worry about it for now. :)

ADD COMMENTlink written 19 months ago by bari.ballew260

Thanks for such an in-depth answer! I used bcftools concat and it worked great.

ADD REPLYlink written 19 months ago by rokragna29510

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.

ADD REPLYlink written 19 months ago by WouterDeCoster45k
gravatar for harish
19 months ago by
harish320 wrote:

Bcftools does this, but is much faster than vcftools.

You have one sample, but the data is split. You'll use a concatenation function., which is bcftools concat.

However, if you have multiple samples, you'll want to merge them together, so use bcftools merge.

In either of the cases, it wouldn't matter if large genomic swathes are not spersed with variants or you have non-overlapping co-ordinates, all the function is going to do is concatenate, check duplicate variants and normalize them if needed.

ADD COMMENTlink written 19 months ago by harish320

Thanks for the really clear answer!

ADD REPLYlink written 19 months ago by rokragna29510
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 973 users visited in the last hour