How do I get the refseq IDs from a list of gene IDs
1
0
Entering edit mode
4.0 years ago
tom5 • 0

I hope you're well. I have a list of entrez gene IDs (such as "426813" and "395451") and want to find the corresponding protein refseq IDs. My goal is to output a txt file with two columns, one for the original entrez gene ID and one for the corresponding refseq ID.

RNA-Seq • 931 views
ADD COMMENT
3
Entering edit mode
4.0 years ago
vkkodali_ncbi ★ 3.7k

You can download and parse the gene2refseq file from NCBI FTP site that has these mappings: https://ftp.ncbi.nlm.nih.gov/gene/DATA/gene2refseq.gz.

If you just have a few gene IDs to work with, you can use Entrez Direct as follows:

cat gene_id_list.txt | while read -r gid ; do 
    echo -ne "$gid\t" ; 
    elink -db gene -id $gid -target protein -name gene_protein_refseq \
        | efetch -format acc \
        | paste -s -d ',' ; 
done > gene2proteins.tsv
396320  NP_990694.1,XP_015144082.1
395771  NP_990262.1,XP_025000385.1,XP_015133186.1,XP_015133180.1,XP_015133175.1

awk 'BEGIN{FS="\t";OFS="\t"}{a=split($2,x,","); for (i=1;i<=a;++i) {print $1,x[i]}}' gene2proteins.tsv
396320  NP_990694.1
396320  XP_015144082.1
395771  NP_990262.1
395771  XP_025000385.1
395771  XP_015133186.1
395771  XP_015133180.1
395771  XP_015133175.1
ADD COMMENT

Login before adding your answer.

Traffic: 1879 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6