bcftools merge of over 9000+ vcf files
2
0
Entering edit mode
11 months ago
Vanish007 ▴ 40

Hi all,

I have around 9000+ vcf files that I'm trying to merge using bcftools merge. They are all located in their own folder so essentially I have a folder containing 9000+ separate folders, each containing one vcf.gz file.

I have tried out the following code via this tutorial

bcftools merge ~/path/to/folders/*.vcf.gz -Oz -o Merged.vcf.gz

However bcftools does not seem to recognize my command since I simply get this error:

About:   Merge multiple VCF/BCF files from non-overlapping sample sets to create one multi-sample file.
     Note that only records from different files can be merged, never from the same file. For
     "vertical" merge take a look at "bcftools norm" instead.
Usage:   bcftools merge [options] <A.vcf.gz> <B.vcf.gz> [...]

Options:
    --force-samples                resolve duplicate sample names
    --print-header                 print only the merged header and exit
    --use-header <file>            use the provided header
-0  --missing-to-ref               assume genotypes at missing sites are 0/0
-f, --apply-filters <list>         require at least one of the listed FILTER strings (e.g. "PASS,.")
-F, --filter-logic <x|+>           remove filters if some input is PASS ("x"), or apply all filters ("+") [+]
-g, --gvcf <-|ref.fa>              merge gVCF blocks, INFO/END tag is expected. Implies -i QS:sum,MinDP:min,I16:sum,IDV:max,IMF:max
-i, --info-rules <tag:method,..>   rules for merging INFO fields (method is one of sum,avg,min,max,join) or "-" to turn off the default [DP:sum,DP4:sum]
-l, --file-list <file>             read file names from the file
-m, --merge <string>               allow multiallelic records for <snps|indels|both|all|none|id>, see man page for details [both]
    --no-version                   do not append version and command line to the header
-o, --output <file>                write output to a file [standard output]
-O, --output-type <b|u|z|v>        'b' compressed BCF; 'u' uncompressed BCF; 'z' compressed VCF; 'v' uncompressed VCF [v]
-r, --regions <region>             restrict to comma-separated list of regions
-R, --regions-file <file>          restrict to regions listed in a file
    --threads <int>                number of extra output compression threads [0]

Any idea on what I'm doing wrong? Thanks!

vcf.gz merge bcftools • 1.6k views
ADD COMMENT
4
Entering edit mode
11 months ago

are all your vcf files indexed ?

also, retype your command from scratch to see if you have an invisible char somewhere

A safer way.

$ find /path/to/folders/ -type f -name "*.vcf.gz" > vcf.list
$ bcftools merge -O z -o merged.vcf.gz --file-list  vcf.list

if you get too "many files open", merge per group of 100 vcf and then merge the group1.vcf.gz group2.vcf.gz ... groupN.vcf.gz

ADD COMMENT
0
Entering edit mode

Thanks Pierre, that does sound like a nice option. I rechecked a subset of files to see if they were indexed and bcftools returned that all files were already indexed. Re-running your method produced the following:

Failed to open /path/to/folders/file3.vep.vcf.gz: could not load index 

Think it would be worth re-indexing these files?

ADD REPLY
0
Entering edit mode

When I try to force overwrite index with the following code:

for FILE in /path/to/folders/*/*.vcf.gz; do
bcftools index -f -o $FILE
done

I get this error for each file:

index: "-" is in a format that cannot be usefully indexed

I also tried re-compressing the vcf before re-indexing:

for FILE in /path/to/folders/*/*.vcf.gz; do
bcftools view -Oz -o $FILE
done

Which results in the error for each of the files:

Failed to open -: unknown file type

I'm essentially trying to merge separate .vcf.gz files that have had Ensembl's VEP already run on them, with bcftools if that makes a difference.

ADD REPLY
1
Entering edit mode

do you have some spaces in any of your path ?

you could try

cat vcf.list | while read V; do bcftools index -f "${V}" ; done
ADD REPLY
0
Entering edit mode

Thanks Pierre, that seemed to do the trick! I don't have any spaces in the file names but there are dashes in them.

So it looks like you concatenate the list and pipe that into V and run bcftools index, correct?

Cool piece of code, thank you for all your help!

ADD REPLY

Login before adding your answer.

Traffic: 1601 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6