How can I retrieve the sequence of a gene and the start and end genomic position using Python
1
0
Entering edit mode
7.1 years ago
doncheng1 • 0

I have a list of genes, I would like to get the sequences of these genes as well as the start and end positions of the genes on the genome using python, I have tried using Biopython but could only get the sequences for the mRNA transcripts of the genes.

Thanks

Sequence Genome gene Position • 5.6k views
ADD COMMENT
1
Entering edit mode

There's a biomart python module, you might just use it since all of this is available from biomart.

N.B., I have never used that module and have no clue how well it works, thus why this is a comment rather than an answer.

ADD REPLY
0
Entering edit mode

I would suggest you to include the header of such a list if genes. This will facilitate the answer

ADD REPLY
0
Entering edit mode

Hi Antonio,

The headers for the gene list are just the Entrez Gene ID and gene symbol,

I have figured out how to get the sequence for the gene but now I need to get the start and end locations on the genome for these sequences

ADD REPLY
1
Entering edit mode
7.1 years ago
samuelmiver ▴ 440

The fastest way I have found is extracting those features directly from a genbank file containing the genes of interest:

#!/usr/bin/env python

from Bio import SeqIO

def list_of_genes():
    """
    Return genes list from file
    """

    genes = []
    with open("./your_gene_list", 'r') as fi:
         for line in fi:
             line = line.strip().split()
             genes.append(line[0])

    return genes

def extract_genes():
    """
    Given a genbank file it extract the coordinates of genes:
    name \t sequence \t start \t end
    """

    genes = list_of_genes()

    fo = open('./test.txt', 'w')

    for rec in SeqIO.parse("./your_gbk_file.gbk", "genbank"):
        if rec.features:
            for feature in rec.features:
                if feature.type == "gene":
                    if feature.qualifiers["locus_tag"][0] in genes:
                        seq = feature.extract(rec.seq)
                        fo.write(feature.qualifiers["locus_tag"][0]+'\t'str(seq)+'\t'+str(feature.location.start.position)+'\t'+str(feature.location.end.position)+'\n')

if __name__ == '__main__':
    extract_genes()
ADD COMMENT

Login before adding your answer.

Traffic: 1957 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6