How to convert a LONG list of Genbank accession numbers for mRNA variants to a list of its cDNA ORF sequences?
0
0
Entering edit mode
6.0 years ago

I am a novice who does not really have much idea on bioinformatics.

I have a list of Genbank accession numbers for mRNA variants which I want to convert to a list of ORF sequences.

These are just very few sequences that I need to deal with: NM_175063.5 NM_005789.3 NM_005789.3 NM_024516.3 NM_032144.2 NM_001911.2

If i put any one of the accession numbers in orf finder in NCBI (https://www.ncbi.nlm.nih.gov/orffinder/) the longest one that pops up is what I need. But I need to do this operation for many many sequences, so it is very tedious to do one by one.. Does anyone know how to do this operation? I don't also know any tool such as R or Python, but my friend suggested to use Python for this kind of work (actually this is the very beginning of bioinformatics process I need to tackle). So if you could tell me with python that would be better. I can use powershell to make operations in linux mode.

sequence gene • 1.5k views
ADD COMMENT
0
Entering edit mode

You can use Entrez Direct to retrieve sequences from your Genbank accession numbers.

Afterward you can try this exercise in Biopython to find ORFs : https://munch-lab.org/2013/11/19/finding-open-reading-frames/

ADD REPLY
0
Entering edit mode

To retrieve all sequences corresponding to your accession list with Eutils, you can do the following :

# Given the accessions in OP
echo 'NM_175063.5 NM_005789.3 NM_024516.3 NM_032144.2 NM_001911.2' > list.txt;
sed -i 's/ /\n/g' list.txt;
cat list.txt \
| epost -db nuccore \
| efetch -format fasta \
> result.fasta;
grep -A 1 '^>' result.fasta;

>NM_005789.3 Homo sapiens proteasome activator subunit 3 (PSME3), transcript variant 1, mRNA
GGAGTGCAGCGGCTGAGAAGGTCCCTTCGGTGAAGGCGAGTTCCGGGACAACAGAGAGGGCCGCACCGTT...
>NM_175063.5 Homo sapiens ER membrane protein complex subunit 10 (EMC10), transcript variant 1, mRNA
TGTTCCTCCCGGCGTGCTCCGCGGCTCTTGGCTCACAGCCGTCCCTTCGCTGGTGGGAAGAAGCCGAGAT...
>NM_024516.3 Homo sapiens PAXIP1 associated glutamate rich protein 1 (PAGR1), mRNA
GGCGCCGTGTCCGGGTGTGGAGAGGGGCGTCGTGGAAGCGAGAAGAGTGGCCCGTCCCTCTCCTCCCCCT...
>NM_001911.2 Homo sapiens cathepsin G (CTSG), mRNA
GCACAGCAGCAACTGACTGGGCAGCCTTTCAGGAAAGATGCAGCCACTCCTGCTTCTGCTGGCCTTTCTC...
>NM_032144.2 Homo sapiens RAB6C, member RAS oncogene family (RAB6C), mRNA
GCGCACTCAGCAGGTTGGGCTGCGGCGGCGGCGGCTGGGGAAGCCGAAGCGCCGCGCGTGAGAGATCCCG...

If you want to use this inside python, you can simple "import sys", then write the command above in a string
For example if you named this string "cmd", you can call the Eutils like this inside Python :

cmd_result = os.popen(cmd).read()

Then use whatever Python package to call ORFs on theses sequences.

ADD REPLY

Login before adding your answer.

Traffic: 3285 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6