I'm trying to retrieve the 3'UTR sequence of a given RefSeq id from the UCSC genome browser.
From its main web interface, this task can be easily accomplished selecting the "refGene" table and "RefSeq" track. However I'm trying to automate the retrieval process, so that the final user will not interact with the UCSC's web interface directly, but rather launch a script to obtain the corresponding 3'UTR sequence of some provided RefSeq identifier.
The UCSC genome browser web interface is therefore off the table.
Looking around I found the possibility to interact with the UCSC via MySQL queries, but since I'm only interested in the 3'UTR sequence, and since I cannot retrieve a genomic sequence of DNA only with a SQL query, I'm relying on the DAS server, as pointed out in this answer.
I therefore obtained an example 3'UTR sequence from the UCSC web interface, and tested if I could retrieve the same sequence through the DAS server:
when retrieved from the UCSC web interface, the RefSeq id "NM_005504" is associated with a 7985 nt long 3'UTR FASTA sequence carrying the header
>hg19_refGene_NM_005504 range=chr12:24962958-24970941 5'pad=0 3'pad=0 strand=- repeatMasking=none
the range "chr12:24962958-24970941" from the previous header can be retrieved from UCSC's hg19 table through the MySQL query
select g.chrom, g.txStart, g.cdsStart from refGene g, knownToRefSeq r where g.name = 'NM_005504' AND r.value = g.name;
I now reuse this information to query the DAS server, and test if I can retrieve the same 3'UTR sequence:
and I obtain a 7985 nt long sequence.
The two retrieval methods return indeed a sequence of the same length, but the two do not coincide.
What am I missing?