How To Get Introns Positions Of A Refseq Protein With Python
1
5
Entering edit mode
13.7 years ago
Dror ▴ 280

I have a list of Refseqs Ids and I want to get the introns position, relative to the protein sequence. Does any one have a python script to grab the introns from the genomic reference of a refseq gene, and get their position in the protein?

intron refseq genomics entrez python • 4.4k views
ADD COMMENT
0
Entering edit mode

I might have a solution for this. Can you provide some of your RefSeq IDs to test it on?

ADD REPLY
7
Entering edit mode
13.7 years ago

The UCSC has already computed this table: see refGene.txt.gz, refGene.sql, here.

The table contains the postion of the exons separated by a comma, you then "just have to" reconstruct the sequence of protein from the reference sequences (here)

curl -s  "http://hgdownload.cse.ucsc.edu/goldenPath/hg19/database/refGene.txt.gz" | gunzip -c | head

971 NR_024227   chr19   -   50595745    50595866    50595866    50595866    1   50595745,   50595866,   0   SNAR-A6 unk unk -1,
971 NR_024227   chr19   -   50601082    50601203    50601203    50601203    1   50601082,   50601203,   0   SNAR-A6 unk unk -1,
629 NM_001014809    chr4    -   5822491 5894785 5823486 5894696 14  5822491,5827220,5830215,5837641,5838491,5841248,5843034,5844819,5851118,5853134,5857869,5862752,5868394,5894315,    5823578,5827386,5830395,5837812,5838633,5841405,5843155,5844888,5851199,5853196,5858034,5862937,5868483,5894785,    0   CRMP1   cmpl    cmpl    1,0,0,0,2,1,0,0,0,1,1,2,0,0,
808 NM_001029883    chr2    -   29284557    29297127    29287734    29297127    2   29284557,29293459,  29287933,29297127,  0   C2orf71cmpl cmpl    2,0,
705 NM_024329   chr1    +   15736390    15756839    15736467    15755220    4   15736390,15752366,15753645,15755088,    15736775,15752514,15753780,15756839,    0   EFHD2   cmpl    cmpl    0,2,0,0,
768 NM_024328   chr14   +   24025197    24028786    24025966    24028049    2   24025197,24027903,  24026513,24028786,  0   THTPA   cmpl    cmpl    0,1,
1379    NM_024326   chr10   +   104179570   104182893   104180886   104182750   4   104179570,104181110,104181543,104182560,    104180939,104181264,104182049,104182893,    0   FBXL15  cmpl    cmpl    0,2,0,2,
826 NM_138275   chr6    +   31691160    31692850    31691160    31692850    4   31691160,31691415,31692541,31692746,    31691221,31691763,31692621,31692850,    0   C6orf25 cmpl    incmpl  0,1,1,0,
609 NM_138275   chr6_cox_hap2   +   3200777 3202467 3200777 3202467 4   3200777,3201032,3202158,3202363,    3200838,3201380,3202238,3202467,    0   C6orf25 cmpl    incmpl  0,1,1,0,
607 NM_138275   chr6_dbb_hap3   +   2976730 2978420 2976730 2978420 4   2976730,2976985,2978111,2978316,    2976791,2977333,2978191,2978420,    0   C6orf25 cmpl    incmpl  0,1,1,0,
ADD COMMENT
0
Entering edit mode

Yes but this contains only a fraction of the refseq data - I need the ability to do it for any organism in refseq, such as cnidarians and trichoplax

ADD REPLY

Login before adding your answer.

Traffic: 1598 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6