Gene symbols from CAGE peak coordinates
2
0
Entering edit mode
4.6 years ago
jiab ▴ 60

Hi,

The coordinates I have look like this:

• chr10:100206067:100206107:+
• chr10:101491968:101492076:+
• chr10:100195025:100195029:-
• chr10:100204220:100204230:-

To find the Gene Symbol, I use the UCSC Genomes Database. I use the following python script:

from cruzdb import Genome

hg19 = Genome("hg19")

# using data from the first example:
set([x.name2 for x in hg19.bin_query("refGene", "chr10", 100206067, 100206107)])


My questions are:

1. In the first example, the return contains two Gene Symbols:

{'HPS1', 'LOC101927278'}

What is the meaning of the second item in the list? Why do I get two items instead of just one?

2. In the second example, I also get two items:

{'COX15', 'CUTC'}

In this case, both items look like the type of result I'm interested in. How can I reconcile two Gene Symbols for the same row in the dataset? Why does this happen in the first place?

3. The third and fourth example both give the same result:

{'HPS1'}

If I want to get an idea about the expression level of this gene, is it sensible to aggregate (i.e sum) the entries?

4. I am not using the information about the strand in my query to refGene. Is this okay?

Thanks

annotation CAGE • 1.0k views
2
Entering edit mode
4.6 years ago

What is the meaning of the second item in the list? Why do I get two items instead of just one?

there are multiple transcript at this position, associated with different gene names.

Homo sapiens uncharacterized LOC101927278 (LOC101927278), transcript variant 4, long non-coding RNA

Homo sapiens HPS1, biogenesis of lysosomal organelles complex 3 subunit 1 (HPS1), transcript variant 7, mRNA

$mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -e 'select distinct chrom,txStart,txEnd,name,name2 from refGene where chrom="chr10" and not (txEnd<100206067 or txStart>100206107)' +-------+-----------+-----------+--------------+--------------+ | chrom | txStart | txEnd | name | name2 | +-------+-----------+-----------+--------------+--------------+ | chr10 | 100175954 | 100206720 | NM_001322477 | HPS1 | | chr10 | 100175954 | 100206720 | NM_001322483 | HPS1 | | chr10 | 100175954 | 100206720 | NM_001322485 | HPS1 | | chr10 | 100188902 | 100206720 | NM_001322492 | HPS1 | | chr10 | 100188902 | 100206720 | NM_001322491 | HPS1 | | chr10 | 100175954 | 100206720 | NM_001322489 | HPS1 | | chr10 | 100188902 | 100206720 | NM_182639 | HPS1 | | chr10 | 100206077 | 100213562 | NR_134454 | LOC101927278 | | chr10 | 100206077 | 100213562 | NR_134453 | LOC101927278 | | chr10 | 100206077 | 100213562 | NR_134452 | LOC101927278 | | chr10 | 100206077 | 100213562 | NR_134451 | LOC101927278 | | chr10 | 100175954 | 100206720 | NM_001322482 | HPS1 | | chr10 | 100175954 | 100206720 | NM_000195 | HPS1 | | chr10 | 100175954 | 100206720 | NM_001322487 | HPS1 | (...) +-------+-----------+-----------+--------------+--------------+  In the second example, I also get two items: again, more than one transcript... ~$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -e 'select distinct chrom,txStart,txEnd,name,name2 from refGene where chrom="chr10" and not (txEnd<101492076 or txStart>101491968)'
+-------+-----------+-----------+--------------+-------+
| chrom | txStart   | txEnd     | name         | name2 |
+-------+-----------+-----------+--------------+-------+
| chr10 | 101491957 | 101515894 | NM_015960    | CUTC  |
| chr10 | 101470624 | 101492423 | NM_001320975 | COX15 |
| chr10 | 101470624 | 101492423 | NM_001320976 | COX15 |
(....)
+-------+-----------+-----------+--------------+-------+


The third and fourth example both give the same result:

\$ mysql --user=genome --host=genome-mysql.soe.ucsc.edu -A -D hg19 -e 'select distinct chrom,txStart,txEnd,name,name2 from refGene where chrom="chr10" and not (txEnd<100195025 or txStart>100195029)'
+-------+-----------+-----------+--------------+-------+
| chrom | txStart   | txEnd     | name         | name2 |
+-------+-----------+-----------+--------------+-------+
| chr10 | 100175954 | 100206720 | NM_001322477 | HPS1  |
| chr10 | 100175954 | 100206720 | NM_001322483 | HPS1  |
(...)
| chr10 | 100175954 | 100206720 | NM_001322480 | HPS1  |
+-------+-----------+-----------+--------------+-------+


I am not using the information about the strand in my query to refGene. Is this okay?

how can we know ? we don't know what you're trying to do

0
Entering edit mode

Thanks, learned several new things from this. I'll add a couple more bits:

• LOC101927278 has cdsStart == cdsEnd (these are additional columns available in the refGene table), which by itself implies it's not coding for a protein
• CUTC and COX15 are actually on different strands (there is a strand column in refGene), although two transcript models on the same strand can also overlap
2
Entering edit mode
4.6 years ago
Charles Plessy ★ 2.7k

The coordinates I have look like this:

chr10:100206067:100206107:+
chr10:101491968:101492076:+


It looks like you are using the FANTOM5 CAGE peaks. Note that their annotation is available from the FANTOM5 website. (http://fantom.gsc.riken.jp/5/datafiles/latest/extra/CAGE_peaks/00_readme.txt). If you are interested in having the coordinates and the gene symbol in a single file, I have shared on GitHub a few commands to make such a file.

If I want to get an idea about the expression level of this gene, is it sensible to aggregate (i.e sum) the entries?

I recommend to use the average of the expression levels expressed in TPM (tags per million), to that the resulting numbers are on a more intuitive scale. Also, if you are interested in a few genes you can just browse their page on the FANTOM5 SSTAR (Semantic catalog of Samples, Transcription initiation And Regulators).

I am not using the information about the strand in my query to refGene. Is this okay?

The CAGE data is directional, therefore you should use the strand information.

0
Entering edit mode

Thanks for your answer. If I have a situation where multiple tags assigned to one gene model are expressed in the same sample, would you say it makes sense to sum their expression levels?

0
Entering edit mode

CAGE is designed to analyse expression levels of peaks of transcription start sites representing alternative promoters of genes. Thus, if you would reduce the data to gene level by summing all tags assigned to the same gene symbol, you would lose potentially valuable information. This said, I sometimes do this, for instance for the purpose of comparing the CAGE data with RNA-seq data, or when the CAGE data is sparse (like with a pilot sequencing on MiSeq).