Merging multiple VCF files into one with bcftools
2
0
Entering edit mode
3.3 years ago
nadya ▴ 30

Hi there,

I have three VCF files for a non-model species that I'd like to merge into a single VCF file. Each file corresponds with a unique set of individuals and a subset of SNPs that is overlapping among files.

It seems based on other posts that a popular tool for performing this task is bcftools merge. I first compress my VCF files in bgzip format with

bgzip -ci infileA.vcf > outfileA.vcf.gz

This also creates an index file outfileA.vcf.gz.gzi

I next attempt to merge the files with

bcftools merge outfileA.vcf.gz outfileB.vcf.gz outfileC.vcf.gz

But I get an error message that the index files cannot be found, even though they are in the same directory: Failed to open outfileA.vcf.gz: could not load index

I've seen posts that discuss this error message when attempting to use other tools in bcftools...some of these posts describe the issue as resulting from the system attempting to open too many files at once. But the system I am working on allows over 4,000 files to be open at once, and either way I am working with only six files (including the compressed VCF files and index files).

When I check the format of my files with htsfile I get the following:

htsfile outfileA.vcf.gz
outfileA.vcf.gz: VCF version 4.2 BGZF-compressed variant calling data

htsfile outfileA.vcf.gz.gzi
outfileA.vcf.gz.gzi: unknown data

This suggests there may be some problem with the index file?

I've also attempted to generate index files outside fo the bgzip command, using bcftools index as follows:

bcftools index -t outfileA.vcf.gz -o outfileA.vcf.gz.tbi

But get an error about sorting: [E::hts_idx_push] Unsorted positions on sequence #1. However, when I then attempt to sort prior to indexing:

bcftools sort outfileA.vcf.gz -o outfileA.sort.vcf.gz

I get another error message: [W::vcf_parse] Contig 'NW_083863.1' is not defined in the header. (Quick workaround: index the file with tabix.)

When I then attempt to index with tabix:

tabix outfileA.vcf.gz 

I get another error, which seems to hint at a need for sorting: [ti_index_core] the file out of order at line 45

My main issue is getting bcftools merge (or an analogous tool) to work for merging my VCF files. Has anyone else run into this same issue? Is there something obvious I'm doing wrong? Thanks for any tips!

software-error genome snp next-gen • 8.4k views
ADD COMMENT
0
Entering edit mode

Can you try:

tabix -p vcf outfileA.vcf.gz

If you get the same error, you might need to create some sort of dummy reference genome for your non-model organism.

ADD REPLY
0
Entering edit mode

Thanks for your quick reply @_r_am! I tried the command you suggested, which returned the error [ti_index_core] the file out of order at line 45

ADD REPLY
0
Entering edit mode

Does your VCF header list all the contigs in the VCF?

Try:

zgrep "^##contig" | sed -Ee 's/##contig=<ID=([^,]+),.+/\1/' > sort > contigs_in_header.txt
zgrep -v "^#" vcf_file | cut -f1 | sort -u >contigs_in_data.txt

The two files above should be identical.

ADD REPLY
0
Entering edit mode

Thanks @_r_am! I tried the code you suggested on one of my vcf.gz files (the same file for each). The task doesn't run to completion, even after several hours. Could be because my vcf is so large (~22Gb file size)? Thanks for your help.

ADD REPLY
0
Entering edit mode

Several hours is a lot of time. The first one should have run created some output almost immediately. Did that happen?

ADD REPLY
1
Entering edit mode
3.3 years ago
rm -f file.list
for F in  outfileA.vcf.gz outfileB.vcf.gz outfileC.vcf.gz
do
      bcftools sort -O b -o ${F}.bcf $F
      bcftools index ${F}.bcf
      echo "${F}.bcf" >> file.list
done

bcftools merge  --file-list file.list  -O z -o merged.vcf.gz
ADD COMMENT
0
Entering edit mode

OP tried that, their VCF header is broken.

ADD REPLY

Login before adding your answer.

Traffic: 2503 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6