merging multiple .bed /.bgen files uk biobank using plink
2
1
Entering edit mode
9 months ago
Agamemnon ▴ 40

Hi All,

I am having a problem merging all chromosomal UK biobank files. I ran the following command.

plink2 \
--bfile /path/to/file/ukb_imp_chr1 \
--pmerge-list /path/to/file/merge.list \
--maf 0.01 \
--hwe 1e-6 \
--make-pgen \
--out /path/to/file/ukb_imp_allchr

I also tried

plink2 \
--bfile /path/to/file/ukb_imp_chr1 \
--pmerge-list /path/to/file/merge.list \
--maf 0.01 \
--hwe 1e-6 \
--make-bed \
--out /path/to/file/ukb_imp_allchr

The merge.list has the following content from chromosome 2 onwards.

/path/to/file/ukb_imp_chr2
/path/to/file/ukb_imp_chr3
/path/to/file/ukb_imp_chr4
...
/path/to/file/ukb_imp_chr22
/path/to/file/ukb_imp_chr23
/path/to/file/ukb_imp_chr24

However, once I run the command, I do not get a merged .bed file. I only get a .psam file with the following result:

Using up to 24 threads (change this with --threads).
--pmerge-list: 24 filesets specified (including main fileset).
--pmerge-list: 487409 samples and 1 phenotype present.
--pmerge-list: Merged .psam written to
/path/to/file/ukb_imp_allchr-merge.psam .

Is there something wrong with the command?

In both instances --make-pgen or --make-bed, I only get the psam file but nothing else.

Also is there any possibility to export to bgen/pgen to reduce file size as all individual chromosomal .bed files are at lest 200-920GB in size.

plink uk biobank merge-list • 1.6k views
ADD COMMENT
0
Entering edit mode
9 months ago

I believe the merging of pgen files is not yet complete for PLINK2 unfortunately. As a work around, I converted my data to vcf and merged using bcftools concat as such:

for f in *.vcf; do bcftools concat $f -o ukb_imp_merge.vcf; done

I then converted vcf files back to PLINK using:

./plink2 --vcf ukb_imp_merge --out binary

ADD COMMENT
0
Entering edit mode

the alternative is to make a .pgen file, and then use --pmerge-list, as an intermediate step, are vcf or pgen files smaller?

ADD REPLY
0
Entering edit mode

Hi, So I have since noticed that you're trying to merge bed files, you can try the --merge-list option (with plink 1.9) first, then convert convert to plink2 pgen

./plink --bfile --merge-list /path/to/file/merge.list --out merged_chrs

./plink2 --bfile merged_chrs --make-pgen --out filename

Or if you would want to merge pgen files; convert all of your indv chr data to pgen files first, then try merging ./plink2 --pfile --pmerge-list /path/to/file/merge.list --out merged_chrs

Pgen files are similar to bed files except they also include dosage information of the SNPs (so in a way are more similar to vcf files)

Let me know if the pgen merge works after conversion - I couldn't get it to work for my data, hence the work around using bcf tools. However even still, I believe will be easier for keeping track of ref/alt alleles when merging in vcf/bcf format (especially if this is post-imputation). It also is faster generally.

ADD REPLY
0
Entering edit mode
9 months ago
Agamemnon ▴ 40

I was able to convert the files to pgen but then I get the following output when trying to merge.

The biallelic variants with ID 'rs6657544' at position 1:1186665 in
/path/to/file/merged_bgen/ukb_imp_chr1.pvar appear to be
the components of a 'split' multiallelic variant; if so, it must be 'joined'
(with e.g. "bcftools norm -m") before a correct merge can occur. If you are
SURE that your data does not contain any same-position same-ID variant groups
that should be joined, you can suppress this error with
--multiallelics-already-joined.

Can bcftools be used directly on .pvar files?

As the following command didn't work for me. bcftools norm -m /path/to/file/merged_bgen/ukb_imp_chr1.pvar

ADD COMMENT
0
Entering edit mode

I dont think so, for bcf tools your data needs to be in vcf.gz format. I would do the following:

  • convert your files from plink to vcf using plink --bfile chr1 --real-ref-alleles --recode vcf --chr1
  • zip your files for i in {1..22}; do bcftools sort chr$i.vcf -Oz -o chr$i.vcf.gz; done
  • Merge (apologies my previous answer's loop, I couldn't get to work so I just ended up typing out all the chrs, any suggestions to do this let me know) - bcftools concat chr1.vcf.gz chr2.vcf.gz chr3.vcf.gz etc. -Oz -o allchromosomes.vcf.gz

  • unzip then convert back to plink :) plink 2 takes vcf files

More details on bcf tools is here https://samtools.github.io/bcftools/bcftools.html#concat

ADD REPLY
1
Entering edit mode

I am actually recoding to bcf but if that fails, I will try vcf instead.

Using the following command

    for i in $(seq 1 22)
    do
    /path/to/file/plink2 \
    --bgen /path/to/file/ukb_imp_chr$i.bgen ref-first \
    --sample /path/to/file/ukb_imp_chr$i.sample \
    --mind 0.01\
    --geno 0.01 \
    --maf 0.01 \
    --hwe 1e-6 \
    --export bcf \
    --out /path/to/file/ukb_imp_chr$i
    done

I think in regards to the for loop failing when trying the bcftools concat etc.... for multiple chromosomes, the logic is failing as it will run the bcftools for only each chromosome files seperately, instead of combined. Using two variables files and i may do the trick.

See if this works for you?

files=$(for i in $(seq 1 22);do echo /path/to/file/chr$i.vcf.gz;done)
bcftools concat $files -Oz -o allchromosomes.vcf.gz
ADD REPLY

Login before adding your answer.

Traffic: 1981 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6