Question: Given a list of RefSeq homo sapiens mRNA transcripts, what is the simplest way to get the corresponding sequences from some other related species?
gravatar for Joel Wallenius
9 months ago by
Joel Wallenius10 wrote:


I have a list of RefSeq accession numbers of mRNA transcripts, all from homo sapiens, and would like to, for each transcript, get the corresponding transcript (and its sequence) from a number of related species like mouse rat cow monkey etc.

I can't imagine that there isn't already a database somewhere for this, but I've spent hours googling and found only ALMOST what I want, like InParanoid, OrthoDB, Ensembl...

Surely many others before me have wished to do something like this? What am I missing?

Thanks in advance and merry weekend!


ADD COMMENTlink written 9 months ago by Joel Wallenius10

Using NCBI eutils (example uses BRCA2). Following is complicated (perhaps un-necessarily) but it will get you the sequence. Unfortunately it will retrieve sequences of individual exons (with just generic fasta header).

esearch -db nuccore -query "NM_000059.3" | \
elink -target homologene | efetch -format docsum | \
xtract -pattern HomoloGeneData -element GeneID | \
xargs -n 1 sh -c 'efetch -db gene -id "$0" -format docsum' | \
xtract -pattern LocationHistType -element ChrAccVer -element ChrStart -element ChrStop | \
xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta'

I will think about alternate ways but someone else may come through in the meantime with an answer.

ADD REPLYlink written 9 months ago by genomax60k

Here is another version. Same limitations.

esearch -db nuccore -query "NM_000059.3" | \
elink -target homologene | elink -target gene | efetch -format docsum |\
xtract -pattern DocumentSummary -element CommonName -element ChrAccVer -element ChrStart -element ChrStop
ADD REPLYlink modified 9 months ago • written 9 months ago by genomax60k

I am completely unfamiliar with these programs. Like you say it seems very complicated. I was thinking that a question such as mine would have been asked many times before. Why would the answer be so complex, then? :/

ADD REPLYlink written 9 months ago by Joel Wallenius10

There are multiple sources of getting genome alignments for mulitple genomes. UCSC has pair-wise and multiple sequence alignments available for multiple genomes. And probably what is more useful, homologene alignments for proteins (brca2 example).

ADD REPLYlink written 9 months ago by genomax60k

Sadly I can't rely on the protein alignments though I know they are better suited for homology in most cases. I must go by the transcripts since what I'm interested in is the exact conservation of a base-pair sequence motif.

I'm starting to suspect that I shouldn't worry too much about finding the true orthologues, and instead just take the first transcript listed at NCBI for each gene... tedious though!

ADD REPLYlink written 9 months ago by Joel Wallenius10
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1104 users visited in the last hour