RefSeq assession IDs
1
0
Entering edit mode
3.0 years ago
chaochao ▴ 20

Hi,

I am using Entrez.esearch in python to find genomic sequence of a gene (APRT gene):

from Bio import Entrez
handle=Entrez.esearch(db="nucleotide", term="APRT [Gene] AND Homo sapien [Organism] AND RefSeq[Filter]")
record=Entrez.read(handle)

In doing so, I received multiple results, including some records with accession IDs starting with NM, NC, NG, KR, CM, AY, etc. Since I am looking for genomic sequence, I think the best ones would be those starting with NG, but there are still three NG entries, including NG_008667, NG_008013 and NG_028266.

Can someone explain what are the differences between these entries and if I should filter with something more? I looked up these accession ID but I cannot tell what are the differences and which one is the best.

Thank you!

RefSeq Gene • 953 views
ADD COMMENT
0
Entering edit mode

NG_008667 is GALNS gene.
NG_008013 is APRT gene
NG_0028266 is CDT1 gene.

ADD REPLY
0
Entering edit mode
3.0 years ago
GenoMax 142k

If you want the sequence of genomic region then using EntrezDirect you can do (sequence is on minus strand so getting rev complement:

$ esearch -db gene -query "APRT [gene] AND human [orgn]" | efetch -format tabular | awk -F "\t" '{OFS="\t"}($1 == "9606"){print $12,$13,$14}' | xargs -n 3 sh -c 'efetch -db nuccore -id "$0" -seq_start "$1" -seq_stop "$2" -revcomp -format fasta'
>NC_000016.10:c88811928-88809339 Homo sapiens chromosome 16, GRCh38.p13 Primary Assembly
GGGCTGCCGCTGGCTCTTCGCACGCGGCCATGGCCGACTCCGAGCTGCAGCTGGTTGAGCAGCGGATCCG
CAGCTTCCCCGACTTCCCCACCCCAGGCGTGGTATTCAGGTGCACGCACAGGCCGCCCTCGTGGCGCCCC

If you want just the RefSeqGene entry (sequence sequence truncated for space)

$ esearch -db gene -query "APRT [gene] AND human [orgn]" | elink -db gene -target nuccore -name gene_nuccore_refseqgene | efetch -format fasta
>NG_008013.1 Homo sapiens adenine phosphoribosyltransferase (APRT), RefSeqGene on chromosome 16
GGGCCGTCGCTCACCTGTTTACACGGGCTGGGCGTGGCTGCCCACAGCCCCTGGATCTGCCGCGCAGGAT
TCGGGAAGAAGGCCCCTCGGCAGCTGCAGACTTCAGCCTGGGCTCCTGCTGTGCGGGCGAAAAGGCCCAG

If you want the RefSeq transcript sequences (sequence truncated for space)

$ esearch -db gene -query "APRT [gene] AND human [orgn]" | elink -db gene -target nuccore -name gene_nuccore_refseqrna | efetch -format fasta
>NM_001030018.2 Homo sapiens adenine phosphoribosyltransferase (APRT), transcript variant 2, mRNA
GGGCTGCCGCTGGCTCTTCGCACGCGGCCATGGCCGACTCCGAGCTGCAGCTGGTTGAGCAGCGGATCCG
CAGCTTCCCCGACTTCCCCACCCCAGGCGTGGTATTCAGGGACATCTCGCCCGTCCTGAAGGACCCCGCC

>NM_000485.3 Homo sapiens adenine phosphoribosyltransferase (APRT), transcript variant 1, mRNA
GGGCTGCCGCTGGCTCTTCGCACGCGGCCATGGCCGACTCCGAGCTGCAGCTGGTTGAGCAGCGGATCCG
CAGCTTCCCCGACTTCCCCACCCCAGGCGTGGTATTCAGGGACATCTCGCCCGTCCTGAAGGACCCCGCC
ADD COMMENT
0
Entering edit mode

Thank you! Do you know why the result will include CDT1 and GALNS using Entrez.esearch? Or Is there anyway I can eliminate those?

ADD REPLY

Login before adding your answer.

Traffic: 1657 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6