Question: Getting mRNA coding sequence for a UCSC accession
0
gravatar for jacobsen.jeremy
6.0 years ago by
United States
jacobsen.jeremy40 wrote:

I cannot figure out how to pull out the coding sequence from KnownGeneMrna.  

I know that sequences in KnownGeneMrna contain UTRs, so what I am doing is taking CDS_start – tx_start from KnownGene to find the start of the CDS from the beginning of the KnownGeneMrna sequence.  The columns in KnownGene are: 

{accession},{chrom},{strand},{tx_start},{tx_end},{CDS_start},{CDS_end} etc..  

The problem is that some transcripts are shorter than the offset!  For instance uc010nyq.2.  Why is this and what am I doing wrong?  I have found other related posts but none that address this point.

Thanks,
Jeremy

hg19 rna-seq ucsc • 1.5k views
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by jacobsen.jeremy40

"The problem is that some transcripts are shorter than the offset! " how did you find this ?

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e 'select count(*) from knownGene where cdsStart<txStart'
+----------+
| count(*) |
+----------+
|        0 |
+----------+

" For instance uc010nyq.2"

 

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg38 -e 'select * from knownGene where name="uc010nyq.2"\G'
*************************** 1. row ***************************
      name: uc010nyq.2
     chrom: chr1
    strand: +
   txStart: 1616309
     txEnd: 1630610
  cdsStart: 1623410
    cdsEnd: 1630530
exonCount: 19
exonStarts: 1616309,1623388,1623773,1624794,1624990,1625285,1625545,1626649,1626836,1627073,1627295,1627672,1628018,1628272,1628488,1629132,1629384,1629638,1630291,
  exonEnds: 1616614,1623699,1623945,1624901,1625185,1625428,1625653,1626754,1626999,1627207,1627444,1627829,1628179,1628399,1628722,1629311,1629566,1629704,1630610,
proteinID: Q96AX9
   alignID: uc010nyq.2

where is the problem ?

 

 

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by Pierre Lindenbaum131k

The problem is that I'm misinterpreting something; probably to do with the contents of knownGeneMrna.

 

uc010nyq.2    chr1    +    1551689    1565990    1558790    1565910

1558790-1551689 = 7101

I take this to mean (erroneously I'm sure) that if I pull the sequence from knownGeneMrna then the coding sequence will begin 7101 bps from the start of the transcript.

The tx length for uc010nyq.2 from knownGeneMrna is 3317, so this cannot be the right assumption.

 

 

ADD REPLYlink modified 6.0 years ago • written 6.0 years ago by jacobsen.jeremy40

Thanks Pierre.  I could do it this way if I were pulling the coding sequences directly from the hg19 chromosome fasta files.  This seems silly though because they are already assembled in knownGeneMrna. The problem is that the UTRs are included in these sequences and I want to remove the UTRs so as to be left with the coding sequence.

Below is the unabridged sequence for uc010nyq.2 taken from knownGeneMrna.  The section in yellow is the coding sequence.  The UTR offset to the start of the coding sequence is 327 characters.  Where is this offset annotated or is there another file that only has the CDS?

 

cgcaattacgggcccggcgctggcggctcctgcgcgctcagaccccagggagcccatccgggcaggcggcggccctgagtgtcgcggccgtgggcccgagtggacctggagccggcgggcagccccgggggcagacaggcgaccgagccgcgggtcgaggtgctaactgtgcatcttggcatctcccctcggccacagggttggaagcccagcgaggctagaggccagtcccaaagtttccaggcatcagggctgcagcccaggagcctcaaggcggcccggcgggcgactggacggccggacaggtcccgagcagcccggcccaccatggacccctctgcccacaggtcccgagcagccccgcccaacatggacccagacccccaggcgggcgtgcaggtgggcatgcgggtggtgcgcggcgtggactggaagtggggccagcaggacggcggcgagggcggcgtgggcacggtggtggagcttggccgccacggcagcccctcgacacccgaccgcacagtggtcgtgcagtgggaccagggcacgcgcaccaactaccgcgccggctaccagggcgcgcacgacctgctgctgtacgacaacgcccagatcggcgtccggcaccccaacatcatctgtgactgctgcaagaagcacgggctgcgggggatgcgctggaagtgccgtgtgtgcctggactacgacctctgcacgcagtgctacatgcacaacaagcatgagctcgcccacgccttcgaccgctacgagaccgctcactcgcgccctgtcacactgagtccccgccagggcctcccgaggatcccactaaggggcatcttccagggagcgaaggtggtgcgaggccccgactgggagtggggctcacaggatggaggggaagggaaaccgggccgtgtggtggacatccgtggctgggatgtggagacaggccggagtgtggccagcgtgacgtgggctgatggtaccaccaatgtgtaccgtgtgggccacaagggcaaggtggacctcaagtgtgtgggcgaggcagcgggcggcttctactacaaggaccacctcccaaggctcggcaagccggcggagctgcagcgcagggtgagtgctgacagccagcccttccagcacggggacaaggtcaagtgtctgctggacactgatgtcctgcgggagatgcaggaaggccacggcggctggaaccccaggatggcggagtttatcggacagacgggcaccgtgcatcgtatcacggaccgcggggacgtgcgcgtgcagttcaaccacgagacgcgctggaccttccaccccggggcgctcaccaagcaccactccttctgggtgggcgacgtggtccgggtcatcggcgaccttgacacagtgaagcggctgcaggctgggcatggcgagtggacggacgacatggcccctgccctgggccgcgtcgggaaggtggtgaaagtgtttggagacgggaacctgcgtgtagcagtcgctggtcagcggtggaccttcagcccctcctgcctggtggcctaccggcccgaggaggatgccaacctggacgtggccgagcgcgcccgggagaacaaaagctcactgagcgtggccctggacaagcttcgggcccagaagagtgacccagagcacccgggaaggctggtggtggaggtggcgctgggtaacgcagcccgggctctggacctgctgcggaggcgcccagagcaggtggacaccaagaaccaaggcaggaccgctctgcaagtggctgcctacctgggccaggtggagttgatacggctgctgctacaagccagggcgggcgtggacctgccggacgacgagggcaacacggcactgcactacgcggccctggggaaccagcccgaggccaccagggtgctcctgagtgctgggtgccgggcggacgccatcaacagcacccagagcacagcactgcacgtggccgtgcagaggggcttcctggaggtggtgcgggccctgtgtgagcgcggctgtgacgtcaacctgcccgacgcccactcggacacgcccctgcactccgccatctcggcgggcactggagccagcggcattgtcgaggtcctcacggaggtgccaaacatcgatgttaccgccaccaacagccagggtttcaccctgctgcaccatgcctccctcaagggtcacgcgctagctgtgagaaagattctggctcgggcgcggcagctggtggacgccaagaaggaggacggcttcacggcgctgcatctggctgccctcaacaaccaccgcgaggtggcccagatcctcatccgggagggccgctgtgacgtgaacgtgcgcaaccggaagctgcagtccccgctgcatctcgccgtgcaacaggcccacgtggggctggtgccgctactggtggacgctgggtgcagtgtcaacgccgaggacgaggagggggacacagccctgcacgtggcgctgcagcgtcatcagctgctgcccctggtggctgatggggccgggggggacccagggcccttgcagctgctgtccaggctacaggcctcgggcctccccggcagcgcggagctgacggtgggcgcggcggtcgcctgcttcctggcgctggagggcgccgacgtgagctacaccaaccaccgcggtcggagcccgctggacctggccgccgagggtcgcgtgctcaaggcccttcagggctgcgcccagcgcttccgggagcggcaggcgggcgggggcgcggccccgggccccaggcaaacgctcgggacccccaacaccgtgacgaacctgcacgtgggcgccgcgccggggcccgaggccgctgagtgcctggtgtgctccgagctggcgctgctggtgctgttctcgccgtgccagcaccgcaccgtgtgtgaggagtgcgcgcgcaggatgaagaagtgcatcaggtgccaggtggtcgtcagcaagaaactgcgcccagacggctctgaggtggcgagcgccgcccccgcccccggcccgccgcgccagctggtggaggagctgcagagccgctaccggcagatggaggaacgcatcacctgccccatctgcatcgacagccacatccgcctcgtgttccagtgcggccacggcgcatgcgccccctgcggctccgcgctcagcgcctgccccatctgccgccagcccatccgcgaccgcatccagatcttcgtgtgagccgcgccgtccgccgcgcccgagctgccttcgcgtgcccccgccctgtgttttataaaaagaaagattctcggacgttg

ADD REPLYlink written 6.0 years ago by jacobsen.jeremy40

please don't post your comments as a new answer.

ADD REPLYlink written 6.0 years ago by Pierre Lindenbaum131k
0
gravatar for Pierre Lindenbaum
6.0 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum131k wrote:

you cannot just substract cdsStart - txStart because there is one the two positions are localized one two distintcs exons:

 

exon1: 1616309-1616614

intron1: 1616614-1623388

exon2:1623388-1623699

 

 txStart: 1616309
 cdsStart: 1623410
-
ADD COMMENTlink modified 6.0 years ago • written 6.0 years ago by Pierre Lindenbaum131k

Thanks Pierre.  I could do it this way if I were pulling the coding sequences directly from the hg19 chromosome fasta files.  This seems silly though because they are already assembled in knownGeneMrna. The problem is that the UTRs are included in these sequences and I want to remove the UTRs so as to be left with the coding sequence.

Below is the unabridged sequence for uc010nyq.2 taken from knownGeneMrna.  The section in yellow is the coding sequence.  The UTR offset to the start of the coding sequence is 327 characters.  Where is this offset annotated or is there another file that only has the CDS?

 

cgcaattacgggcccggcgctggcggctcctgcgcgctcagaccccagggagcccatccgggcaggcggcggccctgagtgtcgcggccgtgggcccgagtggacctggagccggcgggcagccccgggggcagacaggcgaccgagccgcgggtcgaggtgctaactgtgcatcttggcatctcccctcggccacagggttggaagcccagcgaggctagaggccagtcccaaagtttccaggcatcagggctgcagcccaggagcctcaaggcggcccggcgggcgactggacggccggacaggtcccgagcagcccggcccaccatggacccctctgcccacaggtcccgagcagccccgcccaacatggacccagacccccaggcgggcgtgcaggtgggcatgcgggtggtgcgcggcgtggactggaagtggggccagcaggacggcggcgagggcggcgtgggcacggtggtggagcttggccgccacggcagcccctcgacacccgaccgcacagtggtcgtgcagtgggaccagggcacgcgcaccaactaccgcgccggctaccagggcgcgcacgacctgctgctgtacgacaacgcccagatcggcgtccggcaccccaacatcatctgtgactgctgcaagaagcacgggctgcgggggatgcgctggaagtgccgtgtgtgcctggactacgacctctgcacgcagtgctacatgcacaacaagcatgagctcgcccacgccttcgaccgctacgagaccgctcactcgcgccctgtcacactgagtccccgccagggcctcccgaggatcccactaaggggcatcttccagggagcgaaggtggtgcgaggccccgactgggagtggggctcacaggatggaggggaagggaaaccgggccgtgtggtggacatccgtggctgggatgtggagacaggccggagtgtggccagcgtgacgtgggctgatggtaccaccaatgtgtaccgtgtgggccacaagggcaaggtggacctcaagtgtgtgggcgaggcagcgggcggcttctactacaaggaccacctcccaaggctcggcaagccggcggagctgcagcgcagggtgagtgctgacagccagcccttccagcacggggacaaggtcaagtgtctgctggacactgatgtcctgcgggagatgcaggaaggccacggcggctggaaccccaggatggcggagtttatcggacagacgggcaccgtgcatcgtatcacggaccgcggggacgtgcgcgtgcagttcaaccacgagacgcgctggaccttccaccccggggcgctcaccaagcaccactccttctgggtgggcgacgtggtccgggtcatcggcgaccttgacacagtgaagcggctgcaggctgggcatggcgagtggacggacgacatggcccctgccctgggccgcgtcgggaaggtggtgaaagtgtttggagacgggaacctgcgtgtagcagtcgctggtcagcggtggaccttcagcccctcctgcctggtggcctaccggcccgaggaggatgccaacctggacgtggccgagcgcgcccgggagaacaaaagctcactgagcgtggccctggacaagcttcgggcccagaagagtgacccagagcacccgggaaggctggtggtggaggtggcgctgggtaacgcagcccgggctctggacctgctgcggaggcgcccagagcaggtggacaccaagaaccaaggcaggaccgctctgcaagtggctgcctacctgggccaggtggagttgatacggctgctgctacaagccagggcgggcgtggacctgccggacgacgagggcaacacggcactgcactacgcggccctggggaaccagcccgaggccaccagggtgctcctgagtgctgggtgccgggcggacgccatcaacagcacccagagcacagcactgcacgtggccgtgcagaggggcttcctggaggtggtgcgggccctgtgtgagcgcggctgtgacgtcaacctgcccgacgcccactcggacacgcccctgcactccgccatctcggcgggcactggagccagcggcattgtcgaggtcctcacggaggtgccaaacatcgatgttaccgccaccaacagccagggtttcaccctgctgcaccatgcctccctcaagggtcacgcgctagctgtgagaaagattctggctcgggcgcggcagctggtggacgccaagaaggaggacggcttcacggcgctgcatctggctgccctcaacaaccaccgcgaggtggcccagatcctcatccgggagggccgctgtgacgtgaacgtgcgcaaccggaagctgcagtccccgctgcatctcgccgtgcaacaggcccacgtggggctggtgccgctactggtggacgctgggtgcagtgtcaacgccgaggacgaggagggggacacagccctgcacgtggcgctgcagcgtcatcagctgctgcccctggtggctgatggggccgggggggacccagggcccttgcagctgctgtccaggctacaggcctcgggcctccccggcagcgcggagctgacggtgggcgcggcggtcgcctgcttcctggcgctggagggcgccgacgtgagctacaccaaccaccgcggtcggagcccgctggacctggccgccgagggtcgcgtgctcaaggcccttcagggctgcgcccagcgcttccgggagcggcaggcgggcgggggcgcggccccgggccccaggcaaacgctcgggacccccaacaccgtgacgaacctgcacgtgggcgccgcgccggggcccgaggccgctgagtgcctggtgtgctccgagctggcgctgctggtgctgttctcgccgtgccagcaccgcaccgtgtgtgaggagtgcgcgcgcaggatgaagaagtgcatcaggtgccaggtggtcgtcagcaagaaactgcgcccagacggctctgaggtggcgagcgccgcccccgcccccggcccgccgcgccagctggtggaggagctgcagagccgctaccggcagatggaggaacgcatcacctgccccatctgcatcgacagccacatccgcctcgtgttccagtgcggccacggcgcatgcgccccctgcggctccgcgctcagcgcctgccccatctgccgccagcccatccgcgaccgcatccagatcttcgtgtgagccgcgccgtccgccgcgcccgagctgccttcgcgtgcccccgccctgtgttttataaaaagaaagattctcggacgttg

ADD REPLYlink written 6.0 years ago by jacobsen.jeremy40

Got it Pierre.  Thanks! 

cdsStart-txStart-intron1

 

ADD REPLYlink written 6.0 years ago by jacobsen.jeremy40
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1129 users visited in the last hour