VCF merge or concatenate?
2
3
Entering edit mode
2.4 years ago
rokragna295 ▴ 30

I would like to double check whether to use VCF concatenate or VCF merge on my chromosome files. I have done SNP calling using FreeBayes, but split this by chromosomes in order to call SNPs in parallel. I also split some particularly large chromosomes by chromosome position with no overlap, e.g. 1:500,000; 500,001:1,000,000, with the end result being:

Chromosome1-position-1:500,000

Chromosome1-position-500,001:1,000,000

Chromosome2

Chromosome3

I want to first combine the separate VCF files for single, extra data heavy chromosomes that have been split into multiple VCF files for processing. I.e.

Chromosome1-position-1:500,000 + Chromosome1-position-500,001:1,000,000 --> Chromosome1

Then I want to combine all of the separate VCF files into 1, i.e.

Chromosome 1 {merged from step 1} + Chromosome 2 + Chromosome 3 --> SNPs

The VCF tools manual leads me to believe that using VCF-concatenate is the appropriate command for both as they are all separate files of separate chromosomes that I just need to re-attach back to each other but I'm unsure if this is the case. Any advice would be appreciated.

snp VCF VCFTools • 4.4k views
ADD COMMENT
1
Entering edit mode

Not what you are asking for, but the currently recommended tool for things like this is bcftools, and no longer VCFtools.

ADD REPLY
11
Entering edit mode
2.4 years ago
bari.ballew ▴ 320

Take a look at bcftools concat (https://samtools.github.io/bcftools/bcftools.html#concat).

In general, concatenate means adding on rows to your vcf (e.g. re-combining split chromosomes), while merge means creating a superset of variant calls across multiple individuals. Concatenating is relatively straightforward, you just need to keep your headers straight (e.g. going from multiple files each with their own headers, to one file with one set of headers, and ideally an additional header row documenting the command used to concatenate).

Merging across individuals can be more problematic. If you have a gvcf, which represents all positions whether they contain a variant or not, you can merge fairly easily. However, if you only have vcfs, which only report a genomic location if there is a variant in that individual, you encounter a missing data problem. When a variant is reported in A.vcf, but not in B.vcf, the merged file will record the variant as missing "./." for sample B. Does that mean there was insufficient coverage to make a call, or was there plenty of coverage and simply no variant reads? It doesn't sound like this is what you're doing here, so don't worry about it for now. :)

ADD COMMENT
0
Entering edit mode

Thanks for such an in-depth answer! I used bcftools concat and it worked great.

ADD REPLY
0
Entering edit mode

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLY
2
Entering edit mode
2.4 years ago
harish ▴ 370

Bcftools does this, but is much faster than vcftools.

You have one sample, but the data is split. You'll use a concatenation function., which is bcftools concat.

However, if you have multiple samples, you'll want to merge them together, so use bcftools merge.

In either of the cases, it wouldn't matter if large genomic swathes are not spersed with variants or you have non-overlapping co-ordinates, all the function is going to do is concatenate, check duplicate variants and normalize them if needed.

ADD COMMENT
0
Entering edit mode

Thanks for the really clear answer!

ADD REPLY

Login before adding your answer.

Traffic: 2562 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6