Given a list of RefSeq homo sapiens mRNA transcripts, what is the simplest way to get the corresponding sequences from some other related species?
0
1
Entering edit mode
6.0 years ago

Hello!

I have a list of RefSeq accession numbers of mRNA transcripts, all from homo sapiens, and would like to, for each transcript, get the corresponding transcript (and its sequence) from a number of related species like mouse rat cow monkey etc.

I can't imagine that there isn't already a database somewhere for this, but I've spent hours googling and found only ALMOST what I want, like InParanoid, OrthoDB, Ensembl...

Surely many others before me have wished to do something like this? What am I missing?

Thanks in advance and merry weekend!

Joel

alignment transcriptome orthology • 1.5k views
ADD COMMENT
1
Entering edit mode

Using NCBI eutils (example uses BRCA2). Following is complicated (perhaps un-necessarily) but it will get you the sequence. Unfortunately it will retrieve sequences of individual exons (with just generic fasta header).

esearch -db nuccore -query "NM_000059.3" | \
elink -target homologene | efetch -format docsum | \
xtract -pattern HomoloGeneData -element GeneID | \
xargs -n 1 sh -c 'efetch -db gene -id "$0" -format docsum' | \
xtract -pattern LocationHistType -element ChrAccVer -element ChrStart -element ChrStop | \
xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -format fasta'

I will think about alternate ways but someone else may come through in the meantime with an answer.

ADD REPLY
0
Entering edit mode

Here is another version. Same limitations.

esearch -db nuccore -query "NM_000059.3" | \
elink -target homologene | elink -target gene | efetch -format docsum |\
xtract -pattern DocumentSummary -element CommonName -element ChrAccVer -element ChrStart -element ChrStop
ADD REPLY
0
Entering edit mode

I am completely unfamiliar with these programs. Like you say it seems very complicated. I was thinking that a question such as mine would have been asked many times before. Why would the answer be so complex, then? :/

ADD REPLY
0
Entering edit mode

There are multiple sources of getting genome alignments for mulitple genomes. UCSC has pair-wise and multiple sequence alignments available for multiple genomes. And probably what is more useful, homologene alignments for proteins (brca2 example).

ADD REPLY
0
Entering edit mode

Sadly I can't rely on the protein alignments though I know they are better suited for homology in most cases. I must go by the transcripts since what I'm interested in is the exact conservation of a base-pair sequence motif.

I'm starting to suspect that I shouldn't worry too much about finding the true orthologues, and instead just take the first transcript listed at NCBI for each gene... tedious though!

ADD REPLY

Login before adding your answer.

Traffic: 1709 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6