Gene based counts in scikit-allel
1
1
Entering edit mode
5.6 years ago
danaUM ▴ 10

Is it possible to use scikit-allel to generate per sample genotype counts per gene (from gff)?

I've loaded a gff file:

geneset = allel.FeatureTable.from_gff3('~/geneset.gtf', attributes = ["gene_id","gene_name"])

Converted to pandas table following tutorial:

def geneset_to_pandas(geneset):
    """Life is a bit easier when a geneset is a pandas DataFrame."""
    items = []
    for n in geneset.dtype.names:
        v = geneset[n]
        # convert bytes columns to unicode (which pandas then converts to object)
        if v.dtype.kind == 'S':
            v = v.astype('U')
        items.append((n, v))
    return pandas.DataFrame.from_dict(dict(items))

geneset = geneset_to_pandas(geneset)

Loaded vcf file and converted to genotype array:

callset = allel.read_vcf(called.vcf.gz)
gt = allel.GenotypeArray(callset['calldata/GT'])

Started iterating through geneset to create variables for use with query (of vcf file), but am having trouble aggregating the genotype counts by the positions in the geneset:

for index,row in geneset_Homo_sapiens_GRCh37_75.iterrows():
    end = row['end']
    start = row['start']
    chrom = row['seqid']
    gene_name = row['gene_name']
    for i,(x,y) in enumerate(zip(callset['variants/CHROM'],callset['variants/POS'])):
        if (var_chrom == chrom and var_pos >= start and var_pos <= end):
            print var_chrom, var_pos, gene_name, gt[i]
python scikit-allel vcf • 3.1k views
ADD COMMENT
0
Entering edit mode

Add some more detail to your question, including what you've actually tried on your own.

ADD REPLY
0
Entering edit mode

Moved the content of the comment back into the original question section.

ADD REPLY
0
Entering edit mode

Please edit your question and add this in there.

ADD REPLY
2
Entering edit mode
5.6 years ago
alimanfoo ▴ 20

Here's a gist with a worked example using human data. A few extracts...

To load data from a GFF3 file into a pandas DataFrame you can now use the scikit-allel gff3_to_dataframe() function, e.g.:

geneset = allel.gff3_to_dataframe('GRCh38_latest_genomic.gff.gz', attributes=['Name'])
# select only gene records
genes = geneset[geneset['type'] == 'gene']
# select only genes on Chromosome 22
genes_chr22 = genes[genes.seqid == 'NC_000022.11']

To extract genotypes for a given sample and a given gene, it's necessary to obtain the variant start and stop indices corresponding to the first and last variants within the gene. If the data are for a single chromosome, this can be done using a SortedIndex on the variant positions.

If you have previously parsed the VCF out into Zarr format grouped by chromosome as described here, assuming you are working with a single chromosome (e.g., Chromosome 22), this will set up a SortedIndex for all variants in the chromosome:

zarr_path = 'ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.zarr'
import zarr
callset = zarr.open(zarr_path, mode='r')
chrom = '22'
pos = allel.SortedIndex(callset[chrom]['variants/POS'])

This will load genotypes for a given sample over all variants in the chromosome:

# pick an arbitrary sample to work with
sample_idx = 42  # N.B., this is the 43rd sample, zero-based indexing
# load genotypes for the sample
gv = allel.GenotypeVector(callset[chrom]['calldata/GT'][:, sample_idx])

This will then iterate over genes and compute genotype counts for each gene:

# setup some arrays to hold per-gene genotype counts for our sample of interest
import numpy as np
n_genes = len(genes_chr22)
n_hom_ref = np.zeros(n_genes, dtype=int)
n_het = np.zeros(n_genes, dtype=int)
n_hom_alt = np.zeros(n_genes, dtype=int)
n_variants = np.zeros(n_genes, dtype=int)

# iterate over genes
for i, (_, gene) in enumerate(genes_chr22.iterrows()):
    try:
        # locate data for this gene - this maps genomic coordinates onto variant start and stop indices
        loc_gene = pos.locate_range(gene.start, gene.end)
    except KeyError:
        # no data for the gene, leave counts as zero
        pass
    else:
        # extract genotypes for the gene
        gv_gene = gv[loc_gene]
        # compute genotype counts
        n_hom_ref[i] = gv_gene.count_hom_ref()
        n_het[i] = gv_gene.count_het()
        n_hom_alt[i] = gv_gene.count_hom_alt()
        # also store number of variants in the gene
        n_variants[i] = loc_gene.stop - loc_gene.start
ADD COMMENT

Login before adding your answer.

Traffic: 2394 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6