Question: VCF merge or concatenate?
0
gravatar for rokragna295
5 weeks ago by
rokragna2950 wrote:

I would like to double check whether to use VCF concatenate or VCF merge on my chromosome files. I have done SNP calling using FreeBayes, but split this by chromosomes in order to call SNPs in parallel. I also split some particularly large chromosomes by chromosome position with no overlap, e.g. 1:500,000; 500,001:1,000,000, with the end result being:

Chromosome1-position-1:500,000

Chromosome1-position-500,001:1,000,000

Chromosome2

Chromosome3

I want to first combine the separate VCF files for single, extra data heavy chromosomes that have been split into multiple VCF files for processing. I.e.

Chromosome1-position-1:500,000 + Chromosome1-position-500,001:1,000,000 --> Chromosome1

Then I want to combine all of the separate VCF files into 1, i.e.

Chromosome 1 {merged from step 1} + Chromosome 2 + Chromosome 3 --> SNPs

The VCF tools manual leads me to believe that using VCF-concatenate is the appropriate command for both as they are all separate files of separate chromosomes that I just need to re-attach back to each other but I'm unsure if this is the case. Any advice would be appreciated.

snp vcftools vcf • 179 views
ADD COMMENTlink modified 5 weeks ago by harish220 • written 5 weeks ago by rokragna2950
1

Not what you are asking for, but the currently recommended tool for things like this is bcftools, and no longer VCFtools.

ADD REPLYlink written 5 weeks ago by WouterDeCoster40k
3
gravatar for bari.ballew
5 weeks ago by
bari.ballew180
USA/NIH
bari.ballew180 wrote:

Take a look at bcftools concat (https://samtools.github.io/bcftools/bcftools.html#concat).

In general, concatenate means adding on rows to your vcf (e.g. re-combining split chromosomes), while merge means creating a superset of variant calls across multiple individuals. Concatenating is relatively straightforward, you just need to keep your headers straight (e.g. going from multiple files each with their own headers, to one file with one set of headers, and ideally an additional header row documenting the command used to concatenate).

Merging across individuals can be more problematic. If you have a gvcf, which represents all positions whether they contain a variant or not, you can merge fairly easily. However, if you only have vcfs, which only report a genomic location if there is a variant in that individual, you encounter a missing data problem. When a variant is reported in A.vcf, but not in B.vcf, the merged file will record the variant as missing "./." for sample B. Does that mean there was insufficient coverage to make a call, or was there plenty of coverage and simply no variant reads? It doesn't sound like this is what you're doing here, so don't worry about it for now. :)

ADD COMMENTlink written 5 weeks ago by bari.ballew180

Thanks for such an in-depth answer! I used bcftools concat and it worked great.

ADD REPLYlink written 5 weeks ago by rokragna2950

If an answer was helpful you should upvote it, if the answer resolved your question you should mark it as accepted.
Upvote|Bookmark|Accept

ADD REPLYlink written 4 weeks ago by WouterDeCoster40k
1
gravatar for harish
5 weeks ago by
harish220
harish220 wrote:

Bcftools does this, but is much faster than vcftools.

You have one sample, but the data is split. You'll use a concatenation function., which is bcftools concat.

However, if you have multiple samples, you'll want to merge them together, so use bcftools merge.

In either of the cases, it wouldn't matter if large genomic swathes are not spersed with variants or you have non-overlapping co-ordinates, all the function is going to do is concatenate, check duplicate variants and normalize them if needed.

ADD COMMENTlink written 5 weeks ago by harish220

Thanks for the really clear answer!

ADD REPLYlink written 5 weeks ago by rokragna2950
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1670 users visited in the last hour