Gene symbol to mRNA sequence (refseq.rna)
0
0
Entering edit mode
8.8 years ago
rrcutler ▴ 120

I have designed code using modules from mygene.info for gene annotation query (converting gene symbol to refseq.rna) and Biopython for refseq.rna reference to an mRNA fasta file. I have designed this specifically to find mRNA-seq specifically for the model organism Xenopus Laevis.

While this works, it does not successfully find all the gene symbols that are input. For example using the input genes:

notch1, sox2, myf5, myod1, prmt5, wnt1, pax3, zic1, snai2, twist1, foxd3, chrd, bmp4, otx2, nog, lef1, dll1, tubb2b, neurog1, act3

I only can get a return of 11 mRNA sequences in the output fasta file. While all these genes are on Xenbase, I am confused as to why the program is unable to find them.

#Ronald Cutler - Saha Lab 150617
"""A program that takes gene symbol inputs in the Gene List.txt and
 then outputs respective mRNA sequences in the mRNA.fasta"""

import mygene
from Bio import Entrez

#initiate lists and mygene object
mg = mygene.MyGeneInfo()
ids = []
mRNA = []
not_found = []

#read gene symbols from inputfile to ids list
#change this directory if needed
inputfile = open('/Users/confocal/Desktop/Ronald Cutler/Targeted RNA Expression Programs/Gene symbol to mRNA-seq mapping/Gene List.txt', 'r')
for line in inputfile:
    ids.append(line.strip('\n'))
count1 = len(ids)

print('Gene sequence list:', ids)
print()
print('Numebr of input genes=', count1)
print() 
#executing mygene object
out = mg.querymany(ids, species='8355', scope = 'symbol, xenbase', fields='refseq.rna', size= 1)

#extracting mRNA NCBI references from dictionary
for dic in out:
    try:
        mRNA.append(dic[u'refseq.rna'])
    except KeyError:
        not_found.append(dic[u'query'])

#printing out symbols that could not be found
if len(not_found) > 0:
    for symbols in not_found:
        symbols.strip("u")`

    print('These genes could not be found in the database', not_found)
print(mRNA)
mRNA = " ".join(mRNA)
count2 = len(mRNA.split())

#checking if all input genes were given respective mRNA NCBI reference
if count1 != count2:
    print("Warning, either an mRNA sequence was not found or there is a" \
    "duplicate entry of a gene ID. Please Check")

#Searching and outputing respective mRNA sequences
Entrez.email = "rrcutler@email.wm.edu"
handle =Entrez.efetch(db="nucleotide", id=mRNA, rettype="fasta", retmode="text")
result = handle.read()
handle.close()

#writing output file
#change this directory if needed
outputfile = open("/Users/confocal/Desktop/Ronald Cutler/Targeted RNA Expression Programs/Gene symbol to mRNA-seq mapping//mRNA.fasta", "w")
outputfile.write(result)

print("Number of output mRNA sequences=", count2)

inputfile.close()
outputfile.close()
mRNA Xenbase Biopython NCBI RNA-Seq • 2.9k views
ADD COMMENT
0
Entering edit mode

Specifically, the genes not found were: myod1', pax3', snai2', twist1', foxd3', bmp4', otx2', dll1', neurog1

ADD REPLY
0
Entering edit mode

is it possible you should have used 'pax3-a'?

ADD REPLY
0
Entering edit mode

After trying that I received a result for that respective gene! Perhaps it's just a problem with input nomenclature then..

ADD REPLY
0
Entering edit mode

'Input nomenclature' makes it sound like a computer error rather than a human error.

You have to use the appropriate gene symbols.

ADD REPLY

Login before adding your answer.

Traffic: 2373 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6