How can I retrieve the sequence of a gene and the start and end genomic position using Python
1
0
Entering edit mode
7.1 years ago
doncheng1 • 0

I have a list of genes, I would like to get the sequences of these genes as well as the start and end positions of the genes on the genome using python, I have tried using Biopython but could only get the sequences for the mRNA transcripts of the genes.

Thanks

Sequence Genome gene Position • 5.6k views
1
Entering edit mode

There's a biomart python module, you might just use it since all of this is available from biomart.

N.B., I have never used that module and have no clue how well it works, thus why this is a comment rather than an answer.

0
Entering edit mode

I would suggest you to include the header of such a list if genes. This will facilitate the answer

0
Entering edit mode

Hi Antonio,

The headers for the gene list are just the Entrez Gene ID and gene symbol,

I have figured out how to get the sequence for the gene but now I need to get the start and end locations on the genome for these sequences

1
Entering edit mode
7.1 years ago
samuelmiver ▴ 440

The fastest way I have found is extracting those features directly from a genbank file containing the genes of interest:

#!/usr/bin/env python

from Bio import SeqIO

def list_of_genes():
"""
Return genes list from file
"""

genes = []
with open("./your_gene_list", 'r') as fi:
for line in fi:
line = line.strip().split()
genes.append(line[0])

return genes

def extract_genes():
"""
Given a genbank file it extract the coordinates of genes:
name \t sequence \t start \t end
"""

genes = list_of_genes()

fo = open('./test.txt', 'w')

for rec in SeqIO.parse("./your_gbk_file.gbk", "genbank"):
if rec.features:
for feature in rec.features:
if feature.type == "gene":
if feature.qualifiers["locus_tag"][0] in genes:
seq = feature.extract(rec.seq)
fo.write(feature.qualifiers["locus_tag"][0]+'\t'str(seq)+'\t'+str(feature.location.start.position)+'\t'+str(feature.location.end.position)+'\n')

if __name__ == '__main__':
extract_genes()