How do you see the genotypes for one variant in bgen files in Python?
2
1
Entering edit mode
9 months ago
hla_help ▴ 10

I have some bgen and correpsonding samples files and would like to end up with a pandas dataframe of one column sample IDs and another columns with genotypes for one variant for each sample. Does anyone know any tools that can accomplish this?

I have looked at bgen-reader but it doesn't seem to have this usability.

Thank you.

python pandas bgen • 630 views
ADD COMMENT
0
Entering edit mode
9 months ago
carlk ▴ 40

Hi, I'm one of the authors of bgen-reader and would be happy to try to help.

Please take a look at this documentation page and see if it is useful. It covers using sample files.

In the meantime, I'll see if I can figure it out. (I haven't looked at bgen-reader in a while.) - you can also send me an email if you like.

Carl

Carl Kadie carlk@msn.com

ADD COMMENT
0
Entering edit mode
9 months ago
carlk ▴ 40

A nice feature of bgen-reader is that it will create an index for the bgen file. This means that while the first 'open' might take 10 minutes, every 'open after that is instantaneous. Also, reads from just one SNP then take less than 1/10 of a second, even on giant files.

Here is what we came up with:

import numpy as np
import pandas as pd

from bgen_reader import example_filepath, open_bgen

# bgen-reader docs: https://bgen-reader.readthedocs.io/en/latest/numpyapi.html

# Use this sample small file or download the large sample file
# example_filepath("haplotypes.bgen") # little sample file
# snp = "RS3"
# Download from https://www.dropbox.com/sh/vuzozn39vsw8zcl/AAAZT2aRMB3V8kdz6CzlmdY-a?dl=0
file = r"c:\deldir\merged_487400x4840000.bgen" # large file
snp = "sid_5_120000"

# The first time you open the file, it will be indexed and a .metadata2.mmm file will be created.
# This will take about as long as a file copy. About 10 minutes for the big file on my very fast SSD.
# Subsequent openings will be instantaneous.
bgen = open_bgen(file, verbose=False)

variant_index = bgen.rsids==snp
p = bgen.read((variant_index)) # read a single variant
print(p.shape) # => (487400, 1, 3)
print(bgen.allele_ids[variant_index]) # for example ['A,C']
# probabilities for AA,AC,CC
df1 = pd.DataFrame({'sample':bgen.samples,'0':p[:,0,0],'1':p[:,0,1],'2':p[:,0,2]})
del bgen # close the file
print(df1.head())
  • Carl

https://github.com/CarlKCarlK

ADD COMMENT

Login before adding your answer.

Traffic: 1690 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6