Is it possible to use scikit-allel to generate per sample genotype counts per gene (from gff)?
I've loaded a gff file:
geneset = allel.FeatureTable.from_gff3('~/geneset.gtf', attributes = ["gene_id","gene_name"])
Converted to pandas table following tutorial:
def geneset_to_pandas(geneset):
"""Life is a bit easier when a geneset is a pandas DataFrame."""
items = []
for n in geneset.dtype.names:
v = geneset[n]
# convert bytes columns to unicode (which pandas then converts to object)
if v.dtype.kind == 'S':
v = v.astype('U')
items.append((n, v))
return pandas.DataFrame.from_dict(dict(items))
geneset = geneset_to_pandas(geneset)
Loaded vcf file and converted to genotype array:
callset = allel.read_vcf(called.vcf.gz)
gt = allel.GenotypeArray(callset['calldata/GT'])
Started iterating through geneset to create variables for use with query (of vcf file), but am having trouble aggregating the genotype counts by the positions in the geneset:
for index,row in geneset_Homo_sapiens_GRCh37_75.iterrows():
end = row['end']
start = row['start']
chrom = row['seqid']
gene_name = row['gene_name']
for i,(x,y) in enumerate(zip(callset['variants/CHROM'],callset['variants/POS'])):
if (var_chrom == chrom and var_pos >= start and var_pos <= end):
print var_chrom, var_pos, gene_name, gt[i]
Add some more detail to your question, including what you've actually tried on your own.
Moved the content of the comment back into the original question section.
Please edit your question and add this in there.