retrieving the 3' UTR sequence from the UCSC genome browser via DAS server
1
0
Entering edit mode
4.8 years ago
bagnacan • 0

I'm trying to retrieve the 3'UTR sequence of a given RefSeq id from the UCSC genome browser.
From its main web interface, this task can be easily accomplished selecting the "refGene" table and "RefSeq" track. However I'm trying to automate the retrieval process, so that the final user will not interact with the UCSC's web interface directly, but rather launch a script to obtain the corresponding 3'UTR sequence of some provided RefSeq identifier.
The UCSC genome browser web interface is therefore off the table.

Looking around I found the possibility to interact with the UCSC via MySQL queries, but since I'm only interested in the 3'UTR sequence, and since I cannot retrieve a genomic sequence of DNA only with a SQL query, I'm relying on the DAS server, as pointed out in this answer.

I therefore obtained an example 3'UTR sequence from the UCSC web interface, and tested if I could retrieve the same sequence through the DAS server:

• when retrieved from the UCSC web interface, the RefSeq id "NM_005504" is associated with a 7985 nt long 3'UTR FASTA sequence carrying the header

>hg19_refGene_NM_005504 range=chr12:24962958-24970941 5'pad=0 3'pad=0 strand=- repeatMasking=none

• the range "chr12:24962958-24970941" from the previous header can be retrieved from UCSC's hg19 table through the MySQL query

select g.chrom, g.txStart, g.cdsStart from refGene g, knownToRefSeq r where g.name = 'NM_005504' AND r.value = g.name;

• I now reuse this information to query the DAS server, and test if I can retrieve the same 3'UTR sequence:

http://genome.ucsc.edu/cgi-bin/das/hg19/dna?segment=chr12:24962957,24970941

and I obtain a 7985 nt long sequence.

The two retrieval methods return indeed a sequence of the same length, but the two do not coincide.
What am I missing?

ucsc mySQL • 2.9k views
0
Entering edit mode

You should probably consider the fact that DAS server is using +1 index for the first base, as mentioned in the link you posted.

0
Entering edit mode

Thanks. You're right, I forgot to mention I tried to shift the 7985 nt long window, but the issue remains: considering the DAS server indexing from 1 instead of 0, I modified the range that I retrieved through the MySQL query from 24962958-24970941 to 24962959-24970942, but the sequences still do not coincide

0
Entering edit mode

when retrieved from the UCSC web interface, the RefSeq id "NM_005104" is associated with a 7985 nt long 3'UTR FASTA sequence carrying the header

If I search hg19 build with NM_005104 at UCSC I am getting a completely different region on chr 6. But if I use NM_005504 then that refers to chr12. Looks like there are two separate accession #.

0
Entering edit mode

Yes, sorry, it was a typo in my explanation: the accession number whose sequence I was trying to retrieve was the NM_005504, not NM_005104. I edited my post to remove the typo. Thank you!

1
Entering edit mode
4.7 years ago
genecats.ucsc ▴ 580

The issue here is that the transcript you are interested in lies on the negative strand, and so the Table Browser returns a reverse complemented sequence compared to what you are seeing from DAS, which is a simple genomic query. If you reverse complement one of the sequences you will find that they are the same.

ChrisL from the UCSC Genome Browser

0
Entering edit mode

Thank you, this solved the issue