Question

Find The Gene Length Of A Big List Of Genes

1

Entering edit mode

10.1 years ago

Floris Brenk ★ 1.0k

Hi all,

I have a list of hunderds of (human) gene transcripts and I would like to know what their average length is. This is an expression list and of the identified transcripts I found that this list of genes is influenced by the RIN value of the starting material. So the reason I would like to have the gene length is so I can investigate whether gene length is more prone or less prone for RNA degradation.

gene length • 8.4k views

ADD COMMENT • link updated 10.1 years ago by Jarretinha 3.4k • written 10.1 years ago by Floris Brenk ★ 1.0k

1

Entering edit mode

What format do you have your data in?

ADD REPLY • link 10.1 years ago by Michael 54k

0

Entering edit mode

Bed file format of the transcription start site

ADD REPLY • link 10.1 years ago by Floris Brenk ★ 1.0k

0

Entering edit mode

So you don't have end and exon coordinates? Are these novel transcripts or known transcripts?

ADD REPLY • link 10.1 years ago by Michael 54k

0

Entering edit mode

No only the TSS and these are known genes only to start with yes

ADD REPLY • link 10.1 years ago by Floris Brenk ★ 1.0k

1

Entering edit mode

Would it then suffice to use the lengths of already annotated transcripts? Then you could download the transcript annotations from ucsc or ensembl as gff by using the transcript ids. If you do not have the transcript ids but only positions you could search all transcripts for overlap with or closest transcripts to your TSS.

ADD REPLY • link 10.1 years ago by Michael 54k

1

Entering edit mode

Do you have gene IDs or Symbols? Do you need to know the length of the gene or the (average) length of known transcripts for that gene?

ADD REPLY • link 10.1 years ago by Jarretinha 3.4k

0

Entering edit mode

He sorry for late reply just saw it now. I have the gene offical symbols and would like to know the average lenght of known transcripts of the gene yes.

ADD REPLY • link 10.1 years ago by Floris Brenk ★ 1.0k

score 2 · Answer 1 · 2014-03-06

Well, answer is a little bit involved and uses biopython. First, we need to convert gene symbols do Entrez Gene IDs. I do this by brute force mapping. Get the gene_info from NCBI. Convert the file to some form of delimiter separated values file (I use tab separated values with header).

import csv

# Creates the mapping
with open('./data/Homo_sapiens.gene_info.tsv') as f:
    reader = csv.DictReader(f, delimiter='\t')
    mapping = dict()
    for row in reader:
        syn = row['Synonyms']
        syn = syn.split('|')
        tag = [row['GeneID'], row['Symbol']]
        for s in syn:
         mapping[s] = tag
         mapping[row['Symbol']] = tag

with open('your_genes.txt') as f, open('mapped.tsv', 'w') as w:
    header = ['GeneID', 'Symbol']
    writer = csv.writer(w, header, delimiter='\t')
    writer.writerow(header)
    for line in f:
        gene_symbol = line.strip()
        if gene_symbol in mapping.keys():
            writer.writerow(mapping[gene_symbol])
        else:
            print 'Not found: ' + gene_symbol

It's important to stress that GeneCards and NCBI not always share the same gene symbol/synonyms. Now, we can proceed to the fun part:

import csv
from Bio import Entrez, SeqIO

def fetch_mRNA_lengths(GeneID):
    """Get all mRNAs for a given GeneID from Entrez Gene and prints its accessions, GIs and lengths"""
    try:
        handle = Entrez.efetch(db='gene', id=GeneID, retmode='xml')
        record = Entrez.read(handle)[0]
        # The first Entrezgene_locus is for the GRCh38 Primary Assembly
        locus = record['Entrezgene_locus'][0]
        products = locus['Gene-commentary_products']
        for product in products:
            # 3 is for mRNA
            if product['Gene-commentary_type'] == '3':
                accn = product['Gene-commentary_accession']
                gi = product['Gene-commentary_seqs'][0]['Seq-loc_whole']['Seq-id']['Seq-id_gi']
                handle = Entrez.efetch(db='nucleotide', id=gi, rettype='gb')
                record = SeqIO.read(handle, 'gb')
                print GeneID, accn, gi, len(record)
    except:
        print 'Not found: ' + GeneID

Entrez.email = "insert.your@email.here"

with open('mapped.tsv') as f:
    reader = csv.DictReader(f, delimiter='\t')
    for row in reader:
        fetch_mRNA_lengths(str(row['GeneID']))

Have fun!!!