Question

VCF merge or concatenate?

6

Entering edit mode

4.8 years ago

rokragna295 ▴ 60

I would like to double check whether to use VCF concatenate or VCF merge on my chromosome files. I have done SNP calling using FreeBayes, but split this by chromosomes in order to call SNPs in parallel. I also split some particularly large chromosomes by chromosome position with no overlap, e.g. 1:500,000; 500,001:1,000,000, with the end result being:

Chromosome1-position-1:500,000

Chromosome1-position-500,001:1,000,000

Chromosome2

Chromosome3

I want to first combine the separate VCF files for single, extra data heavy chromosomes that have been split into multiple VCF files for processing. I.e.

Chromosome1-position-1:500,000 + Chromosome1-position-500,001:1,000,000 --> Chromosome1

Then I want to combine all of the separate VCF files into 1, i.e.

Chromosome 1 {merged from step 1} + Chromosome 2 + Chromosome 3 --> SNPs

The VCF tools manual leads me to believe that using VCF-concatenate is the appropriate command for both as they are all separate files of separate chromosomes that I just need to re-attach back to each other but I'm unsure if this is the case. Any advice would be appreciated.

snp VCF VCFTools • 19k views

ADD COMMENT • link written 4.8 years ago by rokragna295 ▴ 60

1

Entering edit mode

Not what you are asking for, but the currently recommended tool for things like this is bcftools, and no longer VCFtools.

ADD REPLY • link 4.8 years ago by WouterDeCoster 47k

3

Entering edit mode

4.8 years ago

harish ▴ 450

Bcftools does this, but is much faster than vcftools.

You have one sample, but the data is split. You'll use a concatenation function., which is bcftools concat.

However, if you have multiple samples, you'll want to merge them together, so use bcftools merge.

In either of the cases, it wouldn't matter if large genomic swathes are not spersed with variants or you have non-overlapping co-ordinates, all the function is going to do is concatenate, check duplicate variants and normalize them if needed.

ADD COMMENT • link 4.8 years ago by harish ▴ 450

0

Entering edit mode

Thanks for the really clear answer!

ADD REPLY • link 4.7 years ago by rokragna295 ▴ 60

score 18 · Accepted Answer · 2019-07-19

Take a look at bcftools concat (https://samtools.github.io/bcftools/bcftools.html#concat).

In general, concatenate means adding on rows to your vcf (e.g. re-combining split chromosomes), while merge means creating a superset of variant calls across multiple individuals. Concatenating is relatively straightforward, you just need to keep your headers straight (e.g. going from multiple files each with their own headers, to one file with one set of headers, and ideally an additional header row documenting the command used to concatenate).

Merging across individuals can be more problematic. If you have a gvcf, which represents all positions whether they contain a variant or not, you can merge fairly easily. However, if you only have vcfs, which only report a genomic location if there is a variant in that individual, you encounter a missing data problem. When a variant is reported in A.vcf, but not in B.vcf, the merged file will record the variant as missing "./." for sample B. Does that mean there was insufficient coverage to make a call, or was there plenty of coverage and simply no variant reads? It doesn't sound like this is what you're doing here, so don't worry about it for now. :)