Question: Gene symbols from CAGE peak coordinates
0
gravatar for jiab
13 months ago by
jiab20
jiab20 wrote:

Hi,

The coordinates I have look like this:

  • chr10:100206067:100206107:+
  • chr10:101491968:101492076:+
  • chr10:100195025:100195029:-
  • chr10:100204220:100204230:-

To find the Gene Symbol, I use the UCSC Genomes Database. I use the following python script:

from cruzdb import Genome

hg19 = Genome("hg19")

# using data from the first example:
set([x.name2 for x in hg19.bin_query("refGene", "chr10", 100206067, 100206107)])

My questions are:

  1. In the first example, the return contains two Gene Symbols:

    {'HPS1', 'LOC101927278'}

    What is the meaning of the second item in the list? Why do I get two items instead of just one?

  2. In the second example, I also get two items:

    {'COX15', 'CUTC'}

    In this case, both items look like the type of result I'm interested in. How can I reconcile two Gene Symbols for the same row in the dataset? Why does this happen in the first place?

  3. The third and fourth example both give the same result:

    {'HPS1'}

    If I want to get an idea about the expression level of this gene, is it sensible to aggregate (i.e sum) the entries?

  4. I am not using the information about the strand in my query to refGene. Is this okay?

Thanks

cage annotation • 439 views
ADD COMMENTlink modified 13 months ago by Charles Plessy2.5k • written 13 months ago by jiab20
2
gravatar for Pierre Lindenbaum
13 months ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum108k wrote:

What is the meaning of the second item in the list? Why do I get two items instead of just one?

there are multiple transcript at this position, associated with different gene names.

Homo sapiens uncharacterized LOC101927278 (LOC101927278), transcript variant 4, long non-coding RNA

Homo sapiens HPS1, biogenesis of lysosomal organelles complex 3 subunit 1 (HPS1), transcript variant 7, mRNA

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -e 'select distinct chrom,txStart,txEnd,name,name2 from refGene where chrom="chr10" and not (txEnd<100206067 or txStart>100206107)'
+-------+-----------+-----------+--------------+--------------+
| chrom | txStart   | txEnd     | name         | name2        |
+-------+-----------+-----------+--------------+--------------+
| chr10 | 100175954 | 100206720 | NM_001322477 | HPS1         |
| chr10 | 100175954 | 100206720 | NM_001322483 | HPS1         |
| chr10 | 100175954 | 100206720 | NM_001322485 | HPS1         |
| chr10 | 100188902 | 100206720 | NM_001322492 | HPS1         |
| chr10 | 100188902 | 100206720 | NM_001322491 | HPS1         |
| chr10 | 100175954 | 100206720 | NM_001322489 | HPS1         |
| chr10 | 100188902 | 100206720 | NM_182639    | HPS1         |
| chr10 | 100206077 | 100213562 | NR_134454    | LOC101927278 |
| chr10 | 100206077 | 100213562 | NR_134453    | LOC101927278 |
| chr10 | 100206077 | 100213562 | NR_134452    | LOC101927278 |
| chr10 | 100206077 | 100213562 | NR_134451    | LOC101927278 |
| chr10 | 100175954 | 100206720 | NM_001322482 | HPS1         |
| chr10 | 100175954 | 100206720 | NM_000195    | HPS1         |
| chr10 | 100175954 | 100206720 | NM_001322487 | HPS1         |
(...)
+-------+-----------+-----------+--------------+--------------+

In the second example, I also get two items:

again, more than one transcript...

~$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -e 'select distinct chrom,txStart,txEnd,name,name2 from refGene where chrom="chr10" and not (txEnd<101492076 or txStart>101491968)'
+-------+-----------+-----------+--------------+-------+
| chrom | txStart   | txEnd     | name         | name2 |
+-------+-----------+-----------+--------------+-------+
| chr10 | 101491957 | 101515894 | NM_015960    | CUTC  |
| chr10 | 101470624 | 101492423 | NM_001320975 | COX15 |
| chr10 | 101470624 | 101492423 | NM_001320976 | COX15 |
(....)
+-------+-----------+-----------+--------------+-------+

The third and fourth example both give the same result:

$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -e 'select distinct chrom,txStart,txEnd,name,name2 from refGene where chrom="chr10" and not (txEnd<100195025 or txStart>100195029)'
+-------+-----------+-----------+--------------+-------+
| chrom | txStart   | txEnd     | name         | name2 |
+-------+-----------+-----------+--------------+-------+
| chr10 | 100175954 | 100206720 | NM_001322477 | HPS1  |
| chr10 | 100175954 | 100206720 | NM_001322483 | HPS1  |
(...)
| chr10 | 100175954 | 100206720 | NM_001322480 | HPS1  |
+-------+-----------+-----------+--------------+-------+

I am not using the information about the strand in my query to refGene. Is this okay?

how can we know ? we don't know what you're trying to do

ADD COMMENTlink written 13 months ago by Pierre Lindenbaum108k

Thanks, learned several new things from this. I'll add a couple more bits:

  • LOC101927278 has cdsStart == cdsEnd (these are additional columns available in the refGene table), which by itself implies it's not coding for a protein
  • CUTC and COX15 are actually on different strands (there is a strand column in refGene), although two transcript models on the same strand can also overlap
ADD REPLYlink written 13 months ago by jiab20
2
gravatar for Charles Plessy
13 months ago by
Charles Plessy2.5k
Japan
Charles Plessy2.5k wrote:

The coordinates I have look like this:

chr10:100206067:100206107:+
chr10:101491968:101492076:+
  

It looks like you are using the FANTOM5 CAGE peaks. Note that their annotation is available from the FANTOM5 website. (http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/00_readme.txt). If you are interested in having the coordinates and the gene symbol in a single file, I have shared on GitHub a few commands to make such a file.

If I want to get an idea about the expression level of this gene, is it sensible to aggregate (i.e sum) the entries?

I recommend to use the average of the expression levels expressed in TPM (tags per million), to that the resulting numbers are on a more intuitive scale. Also, if you are interested in a few genes you can just browse their page on the FANTOM5 SSTAR (Semantic catalog of Samples, Transcription initiation And Regulators).

I am not using the information about the strand in my query to refGene. Is this okay?

The CAGE data is directional, therefore you should use the strand information.

ADD COMMENTlink written 13 months ago by Charles Plessy2.5k

Thanks for your answer. If I have a situation where multiple tags assigned to one gene model are expressed in the same sample, would you say it makes sense to sum their expression levels?

ADD REPLYlink written 13 months ago by jiab20

CAGE is designed to analyse expression levels of peaks of transcription start sites representing alternative promoters of genes. Thus, if you would reduce the data to gene level by summing all tags assigned to the same gene symbol, you would lose potentially valuable information. This said, I sometimes do this, for instance for the purpose of comparing the CAGE data with RNA-seq data, or when the CAGE data is sparse (like with a pilot sequencing on MiSeq).

ADD REPLYlink written 13 months ago by Charles Plessy2.5k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1052 users visited in the last hour