how to get cruzbd (cds_sequence) dna translation
2
1
Entering edit mode
5.1 years ago
nkinney06 ▴ 90

I have a genomic region of interest and I am trying to retrieve the gene (if its in one) and amino acid sequence (if its in a translated region). Suppose my region of interest is:

chromosome = "chr19" start = 55014736 stop = 55014753

If I put this into USCS genome browser by hand I see it's in gene GP6 and the translation is LCVSLC. I can use cruzdb and retrieve the correct gene but the translation of cds_sequence does not contain the LCVSLC motif. The program I tried is posted below and you can test that the following command produces no output:

python myCruzdbProgram.py | awk '/LCVSLC/'


How do I simply get the translation programatically for my region of interest?

import cruzdb
from cruzdb import Genome

def translate_dna(sequence):
gencode = { 'ATA':'I',
'ATC':'I', 'ATT':'I', 'ATG':'M', 'ACA':'T', 'ACC':'T', 'ACG':'T',
'ACT':'T', 'AAC':'N', 'AAT':'N', 'AAA':'K', 'AAG':'K', 'AGC':'S',
'AGT':'S', 'AGA':'R', 'AGG':'R', 'CTA':'L', 'CTC':'L', 'CTG':'L',
'CTT':'L', 'CCA':'P', 'CCC':'P', 'CCG':'P', 'CCT':'P', 'CAC':'H',
'CAT':'H', 'CAA':'Q', 'CAG':'Q', 'CGA':'R', 'CGC':'R', 'CGG':'R',
'CGT':'R', 'GTA':'V', 'GTC':'V', 'GTG':'V', 'GTT':'V', 'GCA':'A',
'GCC':'A', 'GCG':'A', 'GCT':'A', 'GAC':'D', 'GAT':'D', 'GAA':'E',
'GAG':'E', 'GGA':'G', 'GGC':'G', 'GGG':'G', 'GGT':'G', 'TCA':'S',
'TCC':'S', 'TCG':'S', 'TCT':'S', 'TTC':'F', 'TTT':'F', 'TTA':'L',
'TTG':'L', 'TAC':'Y', 'TAT':'Y', 'TAA':'\_', 'TAG':'\_', 'TGC':'C',
'TGT':'C', 'TGA':'_', 'TGG':'W', }
proteinseq = ''
for n in range(0,len(sequence),3):
if gencode.has_key(sequence[n:n+3]) == True:
proteinseq += gencode[sequence[n:n+3]]
return proteinseq

g = cruzdb.Genome('hg38')

# this is my region of interest
chromosome = "chr19"
start = 55014736
stop = 55014753

# retrieve annotation of the region with cruzdb
gene = g.bin_query('refGene',chromosome,start,stop).all()

# print the gene name ( it does find the correvt gene )
print gene[0].name2

# get the translated sequence
sequence = "".join(gene[0].cds_sequence)
print sequence

# translation of my 18 base pair region should be LCVSLC but it is not
print translate_dna(sequence.upper())

python cruzdb • 1.5k views
1
Entering edit mode
5.1 years ago
genecats.ucsc ▴ 580

The first issue here is the coordinates you are interested in are on the negative strand, so you'll want to reverse complement the dna you obtain before trying to translate it.

Secondly, you can do all of this via UCSC tools like so:

\$ twoBitToFa -seq=chr19 -start=55014735 -end=55014753 http://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/hg38.2bit stdout  |  faRc stdin stdout | faTrans stdin stdout
>RC_chr19:55014735-55014753
CLSVCL


The twoBitToFa, faTrans and faRc commands are all available from our directory of utilities, in the directory appropriate for your operating system:

If you are interested in querying the refGene table, it may be best to just download it and operate on it on locally your system so as not to violate our Conditions of Use:
http://genome.ucsc.edu/goldenPath/help/mysql.html

If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:

• General questions: genome@soe.ucsc.edu
• Questions involving private data: genome-www@soe.ucsc.edu
• Questions involving mirror sites: genome-mirror@ose.ucsc.edu

ChrisL from the UCSC Genome Browser

0
Entering edit mode

thank you! I have done both; but, I have a followup question. If we take another look at the region I used as an example:

chromosome = "chr19"
start = 55014736
stop = 55014753


and compare to the entry in the refGene.txt file from UCSC we see that this region is between the transcription start and end coordinates but not between the cds start and end coordinates.

field       entry_in_refGene.txt
name        NM_016363
chrom       chr19
strand      -
txStart     55013706
txEnd       55038264
cdsStart    55014920
cdsEnd      55038236
exonCount   8
...


I conclude that my region is transcribed but not translated. However, If I look up my region on the UCSC genome browser:

https://genome.ucsc.edu/cgi-bin/hgTracks?db=hg38&lastVirtModeType=default&lastVirtModeExtraState=&virtModeType=default&virtMode=0&nonVirtPosition=&position=chr19%3A55014736-55014753


I see a translation of my region under "multiz alignments of 100 vertebrates". I suppose the multiz alignment does not necessarily mean the region is translated? Should I stick to the CDS coordinates in the refGene.txt file?

0
Entering edit mode

For that particular transcript, you are in the UTR: https://genome.ucsc.edu/cgi-bin/hgTracks?hgS_doOtherUser=submit&hgS_otherUserName=chmalee&hgS_otherUserSessionName=hg38_utrGp6

The position you are interested in is the blue highlight. The reason you see translation in the Multiz 100 way track is because there is transcript that has a coding region there.

Also note that the codon translation you see in the 100-way alignment is according to the gene tracks as described in the "Base Level" section of the description page: https://genome.ucsc.edu/cgi-bin/hgTrackUi?db=hg38&c=chr19&g=cons100way#TRACK_HTML

And so you will see different codon translations where there are differences between refGene and the knownGene tables.

If you have further questions about UCSC data or tools feel free to send your question to one of the below mailing lists:

• General questions: genome@soe.ucsc.edu
• Questions involving private data: genome-www@soe.ucsc.edu
• Questions involving mirror sites: genome-mirror@ose.ucsc.edu

ChrisL from the UCSC Genome Browser

1
Entering edit mode
5.1 years ago
genecats.ucsc ▴ 580