cruzDB coordinates vs UCSC coordinates
2
0
Entering edit mode
5.1 years ago
nkinney06 ▴ 90

I am confused by a subtle difference in the cruzdb coordinates and UCSC coordinates. For those unfamiliar cruzdb is a python package for accessing the UCSC databases. Suppose I am interested in exon 3/7 for the gene TAF3. This exon extends from base pair 7963920-7965742 on chromosome 10. You can be sure of this by zooming in on the two ends of the exon in the UCSC genome browser:

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr10%3A7963919-7963922&hgsid=651863061_L21GEzydvXVo8gitnxDdfRcTc4hi

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr10%3A7965740-7965743&hgsid=651863061_L21GEzydvXVo8gitnxDdfRcTc4hi

However; cruzdb appears to agree with the start of the exon but is off by one base at the stop:

import cruzdb
from cruzdb import Genome

g = cruzdb.Genome('hg38')

# case 1 only picks up the cds which appears to be the exon range according to cruzdb
chromosome = 'chr10'
start = 7963920
stop = 7965741
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features

# case 2 also pick up the intron on the left
chromosome = 'chr10'
start = 7963919
stop = 7965741
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features

# case 3 also picks up the intron on the right but this is the exon range according to UCSC
chromosome = 'chr10'
start = 7963920
stop = 7965742
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features

I post this at the risk of looking really dumb, but I cant risk annotating hundreds of regions and be off by one base. Thanks!

cruzdb UCSC genome coordinates python • 1.4k views
ADD COMMENT
0
Entering edit mode

Update: for anyone that uses cruzdb you may want to look into this unsolved issue. I received this comment from the developer:

probably, these comparisons:
https://github.com/brentp/cruzdb/blob/master/cruzdb/models.py#L81 need
to use >= instead of >. If you verify that, I'll make the change or
accept a PR.
ADD REPLY
1
Entering edit mode
5.0 years ago
max ▴ 50

UCSC coordinates are 0-based and half-open internally (like cruzDB) but 1-based and closed in the html user interface.

http://genomewiki.ucsc.edu/index.php/Coordinate_Transforms

ADD COMMENT
0
Entering edit mode
5.1 years ago
nkinney06 ▴ 90

I'm surprised this question has received little attention as I am quite convinced there is an inconsistency in the cruzdb coordinate system; running the following code should be convincing enough.

import cruzdb
from cruzdb import Genome

g = cruzdb.Genome('hg38')

#####################################################################
# first an example of the error on the negative strand
#####################################################################

# case 1: this query says the only features are intron 
chromosome = 'chr9'
start = 35812101
stop = 35812102
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features

# case 2: this query says the features are intron and cds
# therefor we can conclude that 35812103 is the first base in the cds
chromosome = 'chr9'
start = 35812102
stop = 35812103
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features

# case 3: this query says the features are intron and cds which does
# not make since because we concluded that 35812103 was in the cds
chromosome = 'chr9'
start = 35812103
stop = 35812104
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features

#####################################################################
# an example of the error on the positive strand
#####################################################################

# case 1: both bases in the cds
chromosome = 'chr7'
start = 128074502
stop = 128074503
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features

# case 2: cds and intron - conclude 128074501 is intron
chromosome = 'chr7'
start = 128074501
stop = 128074502
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features

# case 3:  cds and intron - inconsistent with the above
chromosome = 'chr7'
start = 128074500
stop = 128074501
gene = g.bin_query('refGene',chromosome,start,stop).all()
features = [x.features(start,stop) for x in gene]
print features
ADD COMMENT

Login before adding your answer.

Traffic: 1206 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6