Question: VCF-merge fails due to tabix not producing .tbi files
0
gravatar for brallen
5 months ago by
brallen20
brallen20 wrote:

I'm trying to use vcf-merge to combine 2 exome capture vcf files (~250K and ~330K in size) before trying it on all 96 samples. I'd appreciate any advice on the best way to do that! I've detailed what I've tried below. My issue seems to be with using tabix to convert the files to .tbi format.

Step 1: BGZIP

So far, I've zipped the files without issue:

bgzip sample1.vcf

bgzip sample2.vcf

Which produces:

sample1.vcf.gz

sample2.vcf.gz

Step 2: TABIX

When I try to use this command: tabix -h -p vcf sample1.vcf.gz the stderr is: Region 536999277..536999278 cannot be stored in a tbi index. Try using a csi index with min_shift = 14, n_lvls >= 6. tbx_index_build failed: sample1.vcf.gz

Using the -C option which works:

tabix -C -h -p vcf sample1.vcf.gz

tabix -C -h -p vcf sample2.vcf.gz

Which produces:

sample1.vcf.gz.csi

sample2.vcf.gz.csi

Step 3: VCF-MERGE

When I use this command: vcf-merge sample1.vcf.gz.csi sample2.vcf.gz.csi > out.vcf.gz the merge fails, and I get this stderr:

Broken VCF header, no column names?
 at /usr/share/perl5/Vcf.pm line 172, <__ANONIO__> line 1.
        Vcf::throw(Vcf4_2=HASH(0x5645a760bc38), "Broken VCF header, no column names?") called at /usr/share/perl5/Vcf.pm line 867
        VcfReader::_read_column_names(Vcf4_2=HASH(0x5645a760bc38)) called at /usr/share/perl5/Vcf.pm line 602
        VcfReader::parse_header(Vcf4_2=HASH(0x5645a760bc38)) called at /usr/bin/vcf-merge line 183
        main::init_cols(HASH(0x5645a761f438), Vcf4_2=HASH(0x5645a760b248)) called at /usr/bin/vcf-merge line 279
        main::merge_vcf_files(HASH(0x5645a761f438)) called at /usr/bin/vcf-merge line 12

If I exclude the tabix files, the merge still fails and the stderr says: The column names not tab-separated? Could not load .tbi index of sample1.vcf.gz. The command exited with an error. Is the file tabix indexed?

tabix vcf-merge vcf • 398 views
ADD COMMENTlink modified 5 months ago by Kevin Blighe66k • written 5 months ago by brallen20

Instead of using bold attribute:

Please use the formatting bar (especially the code option) to present your post better.
code_formatting

Thank you!

ADD REPLYlink written 5 months ago by genomax91k

Sorry about that! Fixed it.

ADD REPLYlink written 5 months ago by brallen20

Thanks! Easy to read this way.

ADD REPLYlink written 5 months ago by genomax91k

Hello brallen!

It appears that your post has been cross-posted to another site: https://bioinformatics.stackexchange.com/questions/13282/vcf-merge-fails-due-to-tabix-not-producing-tbi-files

This is typically not recommended as it runs the risk of annoying people in both communities.

I answered your question there and Kevin answered it here with similar answers. In effect, one of our time and effort have been wasted.

ADD REPLYlink modified 5 months ago • written 5 months ago by RamRS30k

Hi, there RamRS

Truly sorry! Didn't think about how selfish that was. I'm new to this and having to teach myself, so was trying to reach as wide an audience as possible. Won't happen again.

ADD REPLYlink written 5 months ago by brallen20
1
gravatar for Kevin Blighe
5 months ago by
Kevin Blighe66k
Kevin Blighe66k wrote:

Hi, there are a couple of issues here.

First, I think that you need bcftools merge, not vcf-merge. VCFtools, as a program suite, is essentially deprecated.

Second, you are attempting the merge on the indices themselves when, instead, you should be trying:

vcf-merge sample1.vcf.gz sample2.vcf.gz

Kevin

ADD COMMENTlink modified 5 months ago • written 5 months ago by Kevin Blighe66k

Thank you, Kevin.

When I run bcftools merge it fails with the following standard error:

[W::bcf_hdr_parse] Could not parse header line: #CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  sample1.bam
[E::bcf_hdr_parse] Could not parse the header, sample line not found
Failed to open sample1.vcf.gz: could not parse header

So it appears 2 things are going on, a) the header can't be parsed, and b) there isn't a sample line in the header. Is there a good solution for either issue? Note: I've been replacing my actual sample ID's with "sample1" and "sample2" for clarity.

ADD REPLYlink written 5 months ago by brallen20

Hmm, it begs the question: how were these VCFs produced?

Out of interest, if you run BGZIP and then just use tabix -p vcf sample1.vcf.gz, does the same error appear?

What is the output of bcftools view -h sample1.vcf.gz?

ADD REPLYlink written 5 months ago by Kevin Blighe66k

Due to the size, I split the original vcf files by chromosome and removed all other contigs from the header. The originals don't have ##SAMPLE either.

I've learned the csi index must be used because all of my chromosomes (conifer) are too large for tbi (>2^29 bp). Would these csi files need to be incorporated into the bcftools merge sample1.vcf.gz sample2.vcf.gz command?

The bcftools view output is the same as bcftools merge

ADD REPLYlink written 5 months ago by brallen20

I see. Without actually seeing the header in one of the problematic files, and at least one dummy variant record, there's not much more that I am prepared to do, as this thread would just continue with me hypothesising what could be the issue. That is, we need a reproducible example to help further.

ADD REPLYlink written 5 months ago by Kevin Blighe66k
1

Thanks for all the help, Kevin! While I'm not sure what specifically caused the problem, it seems trimming the original vcf down to a single chromosome was the issue. When I removed INDELs, used bgzip, bcftools index, and bcftools merge, I was able to merge all 96 files without fail. Cheers!

ADD REPLYlink written 5 months ago by brallen20
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1726 users visited in the last hour