Question

Gene based counts in scikit-allel

1

Entering edit mode

5.6 years ago

danaUM ▴ 10

Is it possible to use scikit-allel to generate per sample genotype counts per gene (from gff)?

I've loaded a gff file:

geneset = allel.FeatureTable.from_gff3('~/geneset.gtf', attributes = ["gene_id","gene_name"])

Converted to pandas table following tutorial:

def geneset_to_pandas(geneset):
    """Life is a bit easier when a geneset is a pandas DataFrame."""
    items = []
    for n in geneset.dtype.names:
        v = geneset[n]
        # convert bytes columns to unicode (which pandas then converts to object)
        if v.dtype.kind == 'S':
            v = v.astype('U')
        items.append((n, v))
    return pandas.DataFrame.from_dict(dict(items))

geneset = geneset_to_pandas(geneset)

Loaded vcf file and converted to genotype array:

callset = allel.read_vcf(called.vcf.gz)
gt = allel.GenotypeArray(callset['calldata/GT'])

Started iterating through geneset to create variables for use with query (of vcf file), but am having trouble aggregating the genotype counts by the positions in the geneset:

for index,row in geneset_Homo_sapiens_GRCh37_75.iterrows():
    end = row['end']
    start = row['start']
    chrom = row['seqid']
    gene_name = row['gene_name']
    for i,(x,y) in enumerate(zip(callset['variants/CHROM'],callset['variants/POS'])):
        if (var_chrom == chrom and var_pos >= start and var_pos <= end):
            print var_chrom, var_pos, gene_name, gt[i]

python scikit-allel vcf • 3.1k views

ADD COMMENT • link updated 5.6 years ago by alimanfoo ▴ 20 • written 5.6 years ago by danaUM ▴ 10

0

Entering edit mode

Add some more detail to your question, including what you've actually tried on your own.

ADD REPLY • link 5.6 years ago by Ram 43k

0

Entering edit mode

Moved the content of the comment back into the original question section.

ADD REPLY • link 5.6 years ago by danaUM ▴ 10

0

Entering edit mode

Please edit your question and add this in there.

ADD REPLY • link 5.6 years ago by Ram 43k

score 2 · Accepted Answer · 2018-08-28

Here's a gist with a worked example using human data. A few extracts...

To load data from a GFF3 file into a pandas DataFrame you can now use the scikit-allel gff3_to_dataframe() function, e.g.:

geneset = allel.gff3_to_dataframe('GRCh38_latest_genomic.gff.gz', attributes=['Name'])
# select only gene records
genes = geneset[geneset['type'] == 'gene']
# select only genes on Chromosome 22
genes_chr22 = genes[genes.seqid == 'NC_000022.11']

To extract genotypes for a given sample and a given gene, it's necessary to obtain the variant start and stop indices corresponding to the first and last variants within the gene. If the data are for a single chromosome, this can be done using a SortedIndex on the variant positions.

If you have previously parsed the VCF out into Zarr format grouped by chromosome as described here, assuming you are working with a single chromosome (e.g., Chromosome 22), this will set up a SortedIndex for all variants in the chromosome:

zarr_path = 'ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.zarr'
import zarr
callset = zarr.open(zarr_path, mode='r')
chrom = '22'
pos = allel.SortedIndex(callset[chrom]['variants/POS'])

This will load genotypes for a given sample over all variants in the chromosome:

# pick an arbitrary sample to work with
sample_idx = 42  # N.B., this is the 43rd sample, zero-based indexing
# load genotypes for the sample
gv = allel.GenotypeVector(callset[chrom]['calldata/GT'][:, sample_idx])

This will then iterate over genes and compute genotype counts for each gene:

# setup some arrays to hold per-gene genotype counts for our sample of interest
import numpy as np
n_genes = len(genes_chr22)
n_hom_ref = np.zeros(n_genes, dtype=int)
n_het = np.zeros(n_genes, dtype=int)
n_hom_alt = np.zeros(n_genes, dtype=int)
n_variants = np.zeros(n_genes, dtype=int)

# iterate over genes
for i, (_, gene) in enumerate(genes_chr22.iterrows()):
    try:
        # locate data for this gene - this maps genomic coordinates onto variant start and stop indices
        loc_gene = pos.locate_range(gene.start, gene.end)
    except KeyError:
        # no data for the gene, leave counts as zero
        pass
    else:
        # extract genotypes for the gene
        gv_gene = gv[loc_gene]
        # compute genotype counts
        n_hom_ref[i] = gv_gene.count_hom_ref()
        n_het[i] = gv_gene.count_het()
        n_hom_alt[i] = gv_gene.count_hom_alt()
        # also store number of variants in the gene
        n_variants[i] = loc_gene.stop - loc_gene.start