Question: Gene symbol to mRNA sequence (refseq.rna)
0
gravatar for rrcutler
4.1 years ago by
rrcutler110
United States
rrcutler110 wrote:

I have designed code using modules from mygene.info for gene annotaion query (converting gene symbol to refseq.rna) and Biopython for refseq.rna reference to an mRNA fasta file. I have designed this specifically to find mRNA-seq specifically for the model organism Xenopus Laevis. 

While this works, it does not successfully find all the gene symbols that are input. For example using the input genes:

notch1, sox2, myf5, myod1, prmt5, wnt1, pax3, zic1, snai2, twist1, foxd3, chrd, bmp4, otx2, nog, lef1, dll1, tubb2b, neurog1, act3

I only can get a return of 11 mRNA sequences in the output fasta file. While all these genes are on Xenbase, I am confused as to why the program is unable to find them.

#Ronald Cutler - Saha Lab 150617
"""A program that takes gene symbol inputs in the Gene List.txt and
 then outputs respective mRNA sequences in the mRNA.fasta"""

import mygene
from Bio import Entrez

#initiate lists and mygene object
mg = mygene.MyGeneInfo()
ids = []
mRNA = []
not_found = []

#read gene symbols from inputfile to ids list
#change this directory if needed
inputfile = open('/Users/confocal/Desktop/Ronald Cutler/Targeted RNA Expression Programs/Gene symbol to mRNA-seq mapping/Gene List.txt', 'r')
for line in inputfile:
    ids.append(line.strip('\n'))
count1 = len(ids)

print('Gene sequence list:', ids)
print()
print('Numebr of input genes=', count1)
print() 
#executing mygene object
out = mg.querymany(ids, species='8355', scope = 'symbol, xenbase', fields='refseq.rna', size= 1)

#extracting mRNA NCBI references from dictionary
for dic in out:
    try:
        mRNA.append(dic[u'refseq.rna'])
    except KeyError:
        not_found.append(dic[u'query'])

#printing out symbols that could not be found
if len(not_found) > 0:
    for symbols in not_found:
        symbols.strip("u")

    print('These genes could not be found in the database', not_found)
print(mRNA)
mRNA = " ".join(mRNA)
count2 = len(mRNA.split())

#checking if all input genes were given respective mRNA NCBI reference
if count1 != count2:
    print("Warning, either an mRNA sequence was not found or there is a" \
    "duplicate entry of a gene ID. Please Check")
    
#Searching and outputing respective mRNA sequences
Entrez.email = "rrcutler@email.wm.edu"
handle =Entrez.efetch(db="nucleotide", id=mRNA, rettype="fasta", retmode="text")
result = handle.read()
handle.close()

#writing output file
#change this directory if needed
outputfile = open("/Users/confocal/Desktop/Ronald Cutler/Targeted RNA Expression Programs/Gene symbol to mRNA-seq mapping//mRNA.fasta", "w")
outputfile.write(result)

print("Number of output mRNA sequences=", count2)

inputfile.close()
outputfile.close()

rna-seq biopython xenbase mrna ncbi • 1.7k views
ADD COMMENTlink modified 4.0 years ago by Biostar ♦♦ 20 • written 4.1 years ago by rrcutler110

Specifically, the genes not found were: myod1', pax3', snai2', twist1', foxd3', bmp4', otx2', dll1', neurog1

ADD REPLYlink written 4.1 years ago by rrcutler110

is it possible you should have used 'pax3-a'?

ADD REPLYlink written 4.1 years ago by russhh4.6k

After trying that I received a result for that respective gene! Perhaps it's just a problem with input nomenclature then..

ADD REPLYlink written 4.1 years ago by rrcutler110

'Input nomenclature' makes it sound like a computer error rather than a human error.

You have to use the appropriate gene symbols.

ADD REPLYlink written 4.1 years ago by russhh4.6k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1668 users visited in the last hour