bcftools merge of over 9000+ vcf files
2.4 years ago
Vanish007 ▴ 40

Hi all,

I have around 9000+ vcf files that I'm trying to merge using bcftools merge. They are all located in their own folder so essentially I have a folder containing 9000+ separate folders, each containing one vcf.gz file.

I have tried out the following code via this tutorial

bcftools merge ~/path/to/folders/*.vcf.gz -Oz -o Merged.vcf.gz

However bcftools does not seem to recognize my command since I simply get this error:

About:   Merge multiple VCF/BCF files from non-overlapping sample sets to create one multi-sample file.
     Note that only records from different files can be merged, never from the same file. For
     "vertical" merge take a look at "bcftools norm" instead.
Usage:   bcftools merge [options] <A.vcf.gz> <B.vcf.gz> [...]

    --force-samples                resolve duplicate sample names
    --print-header                 print only the merged header and exit
    --use-header <file>            use the provided header
-0  --missing-to-ref               assume genotypes at missing sites are 0/0
-f, --apply-filters <list>         require at least one of the listed FILTER strings (e.g. "PASS,.")
-F, --filter-logic <x|+>           remove filters if some input is PASS ("x"), or apply all filters ("+") [+]
-g, --gvcf <-|ref.fa>              merge gVCF blocks, INFO/END tag is expected. Implies -i QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max
-i, --info-rules <tag:method,..>   rules for merging INFO fields (method is one of sum,avg,min,max,join) or "-" to turn off the default [DP:sum,DP4:sum]
-l, --file-list <file>             read file names from the file
-m, --merge <string>               allow multiallelic records for <snps|indels|both|all|none|id>, see man page for details [both]
    --no-version                   do not append version and command line to the header
-o, --output <file>                write output to a file [standard output]
-O, --output-type <b|u|z|v>        'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
-r, --regions <region>             restrict to comma-separated list of regions
-R, --regions-file <file>          restrict to regions listed in a file
    --threads <int>                number of extra output compression threads [0]

Any idea on what I'm doing wrong? Thanks!

vcf bcftools • 4.0k views
2.4 years ago

are all your vcf files indexed ?

also, retype your command from scratch to see if you have an invisible char somewhere

A safer way.

$ find /path/to/folders/ -type f -name "*.vcf.gz" > vcf.list
$ bcftools merge -O z -o merged.vcf.gz --file-list  vcf.list

if you get too "many files open", merge per group of 100 vcf and then merge the group1.vcf.gz group2.vcf.gz ... groupN.vcf.gz

Thanks Pierre, that does sound like a nice option. I rechecked a subset of files to see if they were indexed and bcftools returned that all files were already indexed. Re-running your method produced the following:

Failed to open /path/to/folders/file3.vep.vcf.gz: could not load index 

Think it would be worth re-indexing these files?

When I try to force overwrite index with the following code:

for FILE in /path/to/folders/*/*.vcf.gz; do
bcftools index -f -o $FILE

I get this error for each file:

index: "-" is in a format that cannot be usefully indexed

I also tried re-compressing the vcf before re-indexing:

for FILE in /path/to/folders/*/*.vcf.gz; do
bcftools view -Oz -o $FILE

Which results in the error for each of the files:

Failed to open -: unknown file type

I'm essentially trying to merge separate .vcf.gz files that have had Ensembl's VEP already run on them, with bcftools if that makes a difference.

do you have some spaces in any of your path ?

you could try

cat vcf.list | while read V; do bcftools index -f "${V}" ; done
Thanks Pierre, that seemed to do the trick! I don't have any spaces in the file names but there are dashes in them.

So it looks like you concatenate the list and pipe that into V and run bcftools index, correct?

Cool piece of code, thank you for all your help!


