VCF-merge fails due to tabix not producing .tbi files
1
0
Entering edit mode
3.9 years ago
brallen ▴ 50

I'm trying to use vcf-merge to combine 2 exome capture vcf files (~250K and ~330K in size) before trying it on all 96 samples. I'd appreciate any advice on the best way to do that! I've detailed what I've tried below. My issue seems to be with using tabix to convert the files to .tbi format.

Step 1: BGZIP

So far, I've zipped the files without issue:

bgzip sample1.vcf

bgzip sample2.vcf

Which produces:

sample1.vcf.gz

sample2.vcf.gz

Step 2: TABIX

When I try to use this command: tabix -h -p vcf sample1.vcf.gz the stderr is: Region 536999277..536999278 cannot be stored in a tbi index. Try using a csi index with min_shift = 14, n_lvls >= 6. tbx_index_build failed: sample1.vcf.gz

Using the -C option which works:

tabix -C -h -p vcf sample1.vcf.gz

tabix -C -h -p vcf sample2.vcf.gz

Which produces:

sample1.vcf.gz.csi

sample2.vcf.gz.csi

Step 3: VCF-MERGE

When I use this command: vcf-merge sample1.vcf.gz.csi sample2.vcf.gz.csi > out.vcf.gz the merge fails, and I get this stderr:

Broken VCF header, no column names?
 at /usr/share/perl5/Vcf.pm line 172, <__ANONIO__> line 1.
        Vcf::throw(Vcf4_2=HASH(0x5645a760bc38), "Broken VCF header, no column names?") called at /usr/share/perl5/Vcf.pm line 867
        VcfReader::_read_column_names(Vcf4_2=HASH(0x5645a760bc38)) called at /usr/share/perl5/Vcf.pm line 602
        VcfReader::parse_header(Vcf4_2=HASH(0x5645a760bc38)) called at /usr/bin/vcf-merge line 183
        main::init_cols(HASH(0x5645a761f438), Vcf4_2=HASH(0x5645a760b248)) called at /usr/bin/vcf-merge line 279
        main::merge_vcf_files(HASH(0x5645a761f438)) called at /usr/bin/vcf-merge line 12

If I exclude the tabix files, the merge still fails and the stderr says: The column names not tab-separated? Could not load .tbi index of sample1.vcf.gz. The command exited with an error. Is the file tabix indexed?

vcf vcf-merge tabix • 4.0k views
ADD COMMENT
0
Entering edit mode

Instead of using bold attribute:

Please use the formatting bar (especially the code option) to present your post better.
code_formatting

Thank you!

ADD REPLY
0
Entering edit mode

Sorry about that! Fixed it.

ADD REPLY
0
Entering edit mode

Thanks! Easy to read this way.

ADD REPLY
0
Entering edit mode

Hello brallen!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/13282/vcf-merge-fails-due-to-tabix-not-producing-tbi-files

This is typically not recommended as it runs the risk of annoying people in both communities.

I answered your question there and Kevin answered it here with similar answers. In effect, one of our time and effort have been wasted.

ADD REPLY
0
Entering edit mode

Hi, there RamRS

Truly sorry! Didn't think about how selfish that was. I'm new to this and having to teach myself, so was trying to reach as wide an audience as possible. Won't happen again.

ADD REPLY
2
Entering edit mode
3.9 years ago

Hi, there are a couple of issues here.

First, I think that you need bcftools merge, not vcf-merge. VCFtools, as a program suite, is essentially deprecated.

Second, you are attempting the merge on the indices themselves when, instead, you should be trying:

vcf-merge sample1.vcf.gz sample2.vcf.gz

Kevin

ADD COMMENT
0
Entering edit mode

Thank you, Kevin.

When I run bcftools merge it fails with the following standard error:

[W::bcf_hdr_parse] Could not parse header line: #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1.bam
[E::bcf_hdr_parse] Could not parse the header, sample line not found
Failed to open sample1.vcf.gz: could not parse header

So it appears 2 things are going on, a) the header can't be parsed, and b) there isn't a sample line in the header. Is there a good solution for either issue? Note: I've been replacing my actual sample ID's with "sample1" and "sample2" for clarity.

ADD REPLY
0
Entering edit mode

Hmm, it begs the question: how were these VCFs produced?

Out of interest, if you run BGZIP and then just use tabix -p vcf sample1.vcf.gz, does the same error appear?

What is the output of bcftools view -h sample1.vcf.gz?

ADD REPLY
0
Entering edit mode

Due to the size, I split the original vcf files by chromosome and removed all other contigs from the header. The originals don't have ##SAMPLE either.

I've learned the csi index must be used because all of my chromosomes (conifer) are too large for tbi (>2^29 bp). Would these csi files need to be incorporated into the bcftools merge sample1.vcf.gz sample2.vcf.gz command?

The bcftools view output is the same as bcftools merge

ADD REPLY
0
Entering edit mode

I see. Without actually seeing the header in one of the problematic files, and at least one dummy variant record, there's not much more that I am prepared to do, as this thread would just continue with me hypothesising what could be the issue. That is, we need a reproducible example to help further.

ADD REPLY
1
Entering edit mode

Thanks for all the help, Kevin! While I'm not sure what specifically caused the problem, it seems trimming the original vcf down to a single chromosome was the issue. When I removed INDELs, used bgzip, bcftools index, and bcftools merge, I was able to merge all 96 files without fail. Cheers!

ADD REPLY

Login before adding your answer.

Traffic: 3003 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6