Question

Extract nucleotide sequence from a RefSeq Transcript ID

0

Entering edit mode

21 months ago

LauferVA 4.2k

Hello,

Suppose I want to a nucleotide sequence from a specific transcript isoform for EGFR. I could, then, do something fairly manual like navigate to https://www.ncbi.nlm.nih.gov/nuccore/NM_001346941.2 and look scroll down, then count the nts, then cut and paste.

However, I feel there has got to be (probably many) programmatic ways extract (for example) the 1101st to 1217th nucleotides from this transcript.

I looked around and found things like biomartr::is.genome.available() but this appears to be for higher level downloading, like getting all the transcripts by organism.

I must be missing something. Is there a tool out there that, if given, download_refseq_nt_sequence(NM_001346941.2, '1110','1217'), will return the actual sequence?

Could be R, python, bash, or webtool; i can use any of them.

thank you very much

nucleotide refseq transcript sequence entrez • 1.6k views

ADD COMMENT • link 21 months ago by LauferVA 4.2k

score 2 · Accepted Answer · 2022-08-07

2

Entering edit mode

21 months ago

GenoMax 142k

Using Entrezdirect. Example for getting nucleotides 1101 to 1120.

$ efetch -db nuccore -id NM_001346941.2 -seq_start 1101 -seq_stop 1120 -format fasta
>NM_001346941.2:1101-1120 Homo sapiens epidermal growth factor receptor (EGFR), transcript variant EGFRvIII, mRNA
GGAGTTTGTGGAGAACTCTG

ADD COMMENT • link 21 months ago by GenoMax 142k

0

Entering edit mode

thank you!!!!!!!!!!! this was so helpful. makes me think we convenient browser based tools etc. i really appreciate you.

ADD REPLY • link 21 months ago by LauferVA 4.2k

0

Entering edit mode

suppose i wish to start from an amino acid position insteead, but then still pull nucleotides (or vice versa).

Is there anyway to grab variant positions neatly?

ADD REPLY • link 21 months ago by LauferVA 4.2k

0

Entering edit mode

Looks like i may want something like this (piping thru efetch): esearch -db gene -query "BRCA2 [GENE] AND human [ORGN]" |. efetch -format docsum |

ADD REPLY • link 21 months ago by LauferVA 4.2k

0

Entering edit mode

Can you provide an example?

ADD REPLY • link 21 months ago by GenoMax 142k

0

Entering edit mode

Hey Geno!! Sure. thanks so much for following up.

Suppose what I have is:

NM_001346941             p.N550H

.. but what I want is:

NM_001346941             c.1648A>C

or better yet NM_001346941 (some # of NTs before)C(some # of NTs after) e.g.

NM_001346941             taaCggt

or even rsID:

NM_001346941             rs18448194

ADD REPLY • link 21 months ago by LauferVA 4.2k

1

Entering edit mode

How about (truncated for space). First columns is rsID.

$ esearch -db nuccore -query NM_001346941 | elink -target gene | elink -target snp | esummary | xtract -pattern DocumentSummary -element SNP_ID,DOCSUM
1491558880      HGVS=NC_000007.14:g.55157614_55157619dup,NC_000007.13:g.55225307_55225312dup,NG_007726.3:g.143583_143588dup|SEQ=[-/AAGAAA]|LEN=9|GENE=EGFR:1956
5884400 HGVS=NC_000007.14:g.55123886TG[5],NC_000007.14:g.55123886TG[6],NC_000007.14:g.55123886TG[8],NC_000007.13:g.55191579TG[5],NC_000007.13:g.55191579TG[6],NC_000007.13:g.55191579TG[8],NG_007726.3:g.109855TG[5],NG_007726.3:g.109855TG[6],NG_007726.3:g.109855TG[8]|SEQ=[TGTG/-/TG/TGTGTG]|LEN=14|GENE=EGFR:1956
34058394        HGVS=NC_000007.14:g.55116538CA[6],NC_000007.14:g.55116538CA[7],NC_000007.14:g.55116538CA[8],NC_000007.14:g.55116538CA[9],NC_000007.14:g.55116538CA[11],NC_000007.14:g.55116538CA[12],NC_000007.14:g.55116538CA[13],NC_000007.14:g.55116538CA[14],NC_000007.14:g.55116538CA[15],NC_000007.13:g.55184231CA[6],NC_000007.13:g.55184231CA[7],NC_000007.13:g.55184231CA[8],NC_000007.13:g.55184231CA[9],NC_000007.13:g.55184231CA[11],NC_000007.13:g.55184231CA[12],NC_000007.13:g.55184231CA[13],NC_000007.13:g.55184231CA[14],NC_000007.13:g.55184231CA[15],NG_007726.3:g.102507CA[6],NG_007726.3:g.102507CA[7],NG_007726.3:g.102507CA[8],NG_007726.3:g.102507CA[9],NG_007726.3:g.102507CA[11],NG_007726.3:g.102507CA[12],NG_007726.3:g.102507CA[13],NG_007726.3:g.102507CA[14],NG_007726.3:g.102507CA[15]|SEQ=[CACACACA/-/CA/CACA/CACACA/CACACACACA/CACACACACACA/CACACACACACACA/CACACACACACACACA/CACACACACACACACACA]|LEN=21|GENE=EGFR:1956
1491516373      HGVS=NC_000007.14:g.55157425del,NC_000007.14:g.55157425dup,NC_000007.13:g.55225118del,NC_000007.13:g.55225118dup,NG_007726.3:g.143394del,NG_007726.3:g.143394dup|SEQ=[G/-/GG]|LEN=6|GENE=EGFR:1956