Given a list of RefSeq homo sapiens mRNA transcripts, what is the simplest way to get the corresponding sequences from some other related species?
Hello!

I have a list of RefSeq accession numbers of mRNA transcripts, all from homo sapiens, and would like to, for each transcript, get the corresponding transcript (and its sequence) from a number of related species like mouse rat cow monkey etc.

I can't imagine that there isn't already a database somewhere for this, but I've spent hours googling and found only ALMOST what I want, like InParanoid, OrthoDB, Ensembl...

Surely many others before me have wished to do something like this? What am I missing?

Thanks in advance and merry weekend!

Joel

Using NCBI eutils (example uses BRCA2). Following is complicated (perhaps un-necessarily) but it will get you the sequence. Unfortunately it will retrieve sequences of individual exons (with just generic fasta header).

esearch -db nuccore -query "NM_000059.3" | \
elink -target homologene | efetch -format docsum | \
xtract -pattern HomoloGeneData -element GeneID | \
xargs -n 1 sh -c 'efetch -db gene -id "$0" -format docsum' | \ xtract -pattern LocationHistType -element ChrAccVer -element ChrStart -element ChrStop | \ xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta'


I will think about alternate ways but someone else may come through in the meantime with an answer.

Here is another version. Same limitations.

esearch -db nuccore -query "NM_000059.3" | \
elink -target homologene | elink -target gene | efetch -format docsum |\
xtract -pattern DocumentSummary -element CommonName -element ChrAccVer -element ChrStart -element ChrStop

I am completely unfamiliar with these programs. Like you say it seems very complicated. I was thinking that a question such as mine would have been asked many times before. Why would the answer be so complex, then? :/

There are multiple sources of getting genome alignments for mulitple genomes. UCSC has pair-wise and multiple sequence alignments available for multiple genomes. And probably what is more useful, homologene alignments for proteins (brca2 example).

Sadly I can't rely on the protein alignments though I know they are better suited for homology in most cases. I must go by the transcripts since what I'm interested in is the exact conservation of a base-pair sequence motif.

I'm starting to suspect that I shouldn't worry too much about finding the true orthologues, and instead just take the first transcript listed at NCBI for each gene... tedious though!