Question

How to convert a LONG list of Genbank accession numbers for mRNA variants to a list of its cDNA ORF sequences?

0

Entering edit mode

6.0 years ago

chanwoo1143 • 0

I am a novice who does not really have much idea on bioinformatics.

I have a list of Genbank accession numbers for mRNA variants which I want to convert to a list of ORF sequences.

These are just very few sequences that I need to deal with: NM_175063.5 NM_005789.3 NM_005789.3 NM_024516.3 NM_032144.2 NM_001911.2

If i put any one of the accession numbers in orf finder in NCBI (https://www.ncbi.nlm.nih.gov/orffinder/) the longest one that pops up is what I need. But I need to do this operation for many many sequences, so it is very tedious to do one by one.. Does anyone know how to do this operation? I don't also know any tool such as R or Python, but my friend suggested to use Python for this kind of work (actually this is the very beginning of bioinformatics process I need to tackle). So if you could tell me with python that would be better. I can use powershell to make operations in linux mode.

sequence gene • 1.5k views

ADD COMMENT • link 6.0 years ago by chanwoo1143 • 0

0

Entering edit mode

You can use Entrez Direct to retrieve sequences from your Genbank accession numbers.

Afterward you can try this exercise in Biopython to find ORFs : https://munch-lab.org/2013/11/19/finding-open-reading-frames/

ADD REPLY • link 6.0 years ago by Bastien Hervé 5.3k

0

Entering edit mode

To retrieve all sequences corresponding to your accession list with Eutils, you can do the following :

# Given the accessions in OP
echo 'NM_175063.5 NM_005789.3 NM_024516.3 NM_032144.2 NM_001911.2' > list.txt;
sed -i 's/ /\n/g' list.txt;
cat list.txt \
| epost -db nuccore \
| efetch -format fasta \
> result.fasta;
grep -A 1 '^>' result.fasta;

>NM_005789.3 Homo sapiens proteasome activator subunit 3 (PSME3), transcript variant 1, mRNA
GGAGTGCAGCGGCTGAGAAGGTCCCTTCGGTGAAGGCGAGTTCCGGGACAACAGAGAGGGCCGCACCGTT...
>NM_175063.5 Homo sapiens ER membrane protein complex subunit 10 (EMC10), transcript variant 1, mRNA
TGTTCCTCCCGGCGTGCTCCGCGGCTCTTGGCTCACAGCCGTCCCTTCGCTGGTGGGAAGAAGCCGAGAT...
>NM_024516.3 Homo sapiens PAXIP1 associated glutamate rich protein 1 (PAGR1), mRNA
GGCGCCGTGTCCGGGTGTGGAGAGGGGCGTCGTGGAAGCGAGAAGAGTGGCCCGTCCCTCTCCTCCCCCT...
>NM_001911.2 Homo sapiens cathepsin G (CTSG), mRNA
GCACAGCAGCAACTGACTGGGCAGCCTTTCAGGAAAGATGCAGCCACTCCTGCTTCTGCTGGCCTTTCTC...
>NM_032144.2 Homo sapiens RAB6C, member RAS oncogene family (RAB6C), mRNA
GCGCACTCAGCAGGTTGGGCTGCGGCGGCGGCGGCTGGGGAAGCCGAAGCGCCGCGCGTGAGAGATCCCG...

If you want to use this inside python, you can simple "import sys", then write the command above in a string
For example if you named this string "cmd", you can call the Eutils like this inside Python :

cmd_result = os.popen(cmd).read()

Then use whatever Python package to call ORFs on theses sequences.

ADD REPLY • link 6.0 years ago by erwan.scaon ▴ 940