Question

How to obtain TSS from list of given genes

1

Entering edit mode

8.4 years ago

tanni93 ▴ 50

Hi all,

I have a list of genes (see format below) and I would like to obtain the TSS for all 19 of these genes. How can I do this?

ENSMUSG00000029847
ENSMUSG00000085236
ENSMUSG00000063364
ENSMUSG00000085247
ENSMUSG00000072893
ENSMUSG00000018800
ENSMUSG00000020865
ENSMUSG00000023832
ENSMUSG00000045730
ENSMUSG00000007827
ENSMUSG00000021950
ENSMUSG00000071847
ENSMUSG00000025154
ENSMUSG00000047446
ENSMUSG00000026628
ENSMUSG00000029673

RNA-Seq r • 8.6k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.4 years ago by tanni93 ▴ 50

0

Entering edit mode

could you provide the format, please

ADD REPLY • link 8.4 years ago by russhh 5.7k

0

Entering edit mode

I am using this site: https://genome.ucsc.edu/cgi-bin/hgTables

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by tanni93 ▴ 50

Ram · Answer 1 · 2015-11-24

1

Entering edit mode

8.4 years ago

Giovanni M Dall'Olio 28k

There are many ways to do this. Apart from biomaRt, you can use the Mus musculus TxDb object:

> source("https://bioconductor.org/biocLite.R")
> biocLite('TxDb.Mmusculus.UCSC.mm10.ensGene')
> mygenes = c("ENSMUSG00000029847",
"ENSMUSG00000085236", "ENSMUSG00000063364","ENSMUSG00000085247",
"ENSMUSG00000072893","ENSMUSG00000018800","ENSMUSG00000020865",
"ENSMUSG00000023832","ENSMUSG00000045730","ENSMUSG00000007827",
"ENSMUSG00000021950","ENSMUSG00000071847","ENSMUSG00000025154",
"ENSMUSG00000047446","ENSMUSG00000026628","ENSMUSG00000029673")

> mygenes.transcripts = subset(transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns=c("tx_id", "tx_name","gene_id")), gene_id %in% mygenes)
GRanges object with 62 ranges and 3 metadata columns:
       seqnames                 ranges strand   |     tx_id            tx_name            gene_id
          <Rle>              <IRanges>  <Rle>   | <integer>        <character>    <CharacterList>
   [1]     chr1 [191170296, 191183340]      -   |      4925 ENSMUST00000027941 ENSMUSG00000026628
   [2]     chr1 [191171425, 191183108]      -   |      4926 ENSMUST00000131854 ENSMUSG00000026628
   [3]     chr2 [135169573, 135215616]      -   |     13731 ENSMUST00000138303 ENSMUSG00000085247
   [4]     chr2 [150310935, 150362765]      -   |     13893 ENSMUST00000051153 ENSMUSG00000063364
   [5]     chr2 [150310937, 150362733]      -   |     13894 ENSMUST00000124945 ENSMUSG00000063364
   ...      ...                    ...    ... ...       ...                ...                ...
  [58]    chr18   [62177817, 62179959]      -   |     86501 ENSMUST00000053640 ENSMUSG00000045730
  [59]    chr19   [41766588, 41802047]      -   |     88750 ENSMUST00000026150 ENSMUSG00000025154
  [60]    chr19   [41766591, 41802084]      -   |     88751 ENSMUST00000163265 ENSMUSG00000025154
  [61]    chr19   [41769800, 41781336]      -   |     88752 ENSMUST00000176266 ENSMUSG00000025154
  [62]    chr19   [41769994, 41802047]      -   |     88753 ENSMUST00000177495 ENSMUSG00000025154

This will create a GenomicRanges object called mygenes.transcripts, from which you can access the coordinates of all transcripts of each gene. If you want to avoid complications and prefer to have just one coordinate per gene, use genes() instead of transcripts()

To get the TSS, just resize the object to one base:

> mygenes.tss = resize(mygenes.transcripts, width=1, fix='start')

ADD COMMENT • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

I ran into some errors and had to specify the package namespaces and load the GenomicFeatures library:

> source("https://bioconductor.org/biocLite.R")
...
> biocLite('TxDb.Mmusculus.UCSC.mm10.ensGene')
...
> library("GenomicFeatures")
...
> mygenes = c("ENSMUSG00000029847",
"ENSMUSG00000085236", "ENSMUSG00000063364","ENSMUSG00000085247",
"ENSMUSG00000072893","ENSMUSG00000018800","ENSMUSG00000020865",
"ENSMUSG00000023832","ENSMUSG00000045730","ENSMUSG00000007827",
"ENSMUSG00000021950","ENSMUSG00000071847","ENSMUSG00000025154",
"ENSMUSG00000047446","ENSMUSG00000026628","ENSMUSG00000029673")
> mygenes.transcripts = subset(GenomicFeatures::transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene, columns=c("tx_id", "tx_name","gene_id")), gene_id %in% mygenes)
GRanges object with 62 ranges and 3 metadata columns:
       seqnames                 ranges strand   |     tx_id            tx_name
          <Rle>              <IRanges>  <Rle>   | <integer>        <character>
   [1]     chr1 [191170296, 191183340]      -   |      4925 ENSMUST00000027941
   [2]     chr1 [191171425, 191183108]      -   |      4926 ENSMUST00000131854
   [3]     chr2 [135169573, 135215616]      -   |     13731 ENSMUST00000138303
   [4]     chr2 [150310935, 150362765]      -   |     13893 ENSMUST00000051153
   [5]     chr2 [150310937, 150362733]      -   |     13894 ENSMUST00000124945
   ...      ...                    ...    ... ...       ...                ...
  [58]    chr18   [62177817, 62179959]      -   |     86501 ENSMUST00000053640
  [59]    chr19   [41766588, 41802047]      -   |     88750 ENSMUST00000026150
  [60]    chr19   [41766591, 41802084]      -   |     88751 ENSMUST00000163265
  [61]    chr19   [41769800, 41781336]      -   |     88752 ENSMUST00000176266
  [62]    chr19   [41769994, 41802047]      -   |     88753 ENSMUST00000177495
                  gene_id
          <CharacterList>
   [1] ENSMUSG00000026628
   [2] ENSMUSG00000026628
   [3] ENSMUSG00000085247
   [4] ENSMUSG00000063364
   [5] ENSMUSG00000063364
   ...                ...
  [58] ENSMUSG00000045730
  [59] ENSMUSG00000025154
  [60] ENSMUSG00000025154
  [61] ENSMUSG00000025154
  [62] ENSMUSG00000025154
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

It looks like the fix setting in resize uses the correct start position for the interval's strand. Note that these resized coordinates are 1-based, if the OP plans to ultimately export BED data.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Alex Reynolds 35k

0

Entering edit mode

Thank you for the correction. Indeed many Bioconductor libraries have problems of namespaces. In particular loading dplyr after AnnotationDbi causes a lot of naming conflicts, e.g. on the select function.

ADD REPLY • link 8.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

Hi Giovanni,

Thank you for the help!

A few questions: (I am relevatively new to R).

for mygenes.transcripts, I had copy and pasted

mygenes.transcripts = subset(transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns=c("tx_id", "tx_name","gene_id")), gene_id %in% mygenes)

but error keeps saying

Error in subset(transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns = c("tx_id",  : 
  error in evaluating the argument 'x' in selecting a method for function 'subset': Error in transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns = c("tx_id",  : 
  error in evaluating the argument 'x' in selecting a method for function 'transcripts': Error: object 'TxDb.Mmusculus.UCSC.mm10.ensGene' not found

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by tanni93 ▴ 50

0

Entering edit mode

Hi Giovanni,

Thank you for the help!

A few questions: (I am relevatively new to R).

for mygenes.transcripts, I had copy and pasted

mygenes.transcripts = subset(transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns=c("tx_id", "tx_name","gene_id")), gene_id %in% mygenes)

but error keeps saying

Error in subset(transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns = c("tx_id",  : 
  error in evaluating the argument 'x' in selecting a method for function 'subset': Error in transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns = c("tx_id",  : 
  error in evaluating the argument 'x' in selecting a method for function 'transcripts': Error: object 'TxDb.Mmusculus.UCSC.mm10.ensGene' not found

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by tanni93 ▴ 50

0

Entering edit mode

Also, on a separate note: how do I format mygenes if I have about 150 genes? Would I have to manually type in these ensemble IDs or is there a shortcut?

Thank you in advance,
Tanni

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by tanni93 ▴ 50

0

Entering edit mode

Save your ensembl IDs in a file, one per line, then read it with read.csv('filename', header=FALSE)

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

You need to install the Mus musculus TxDb package in your system. The easiest way is to run biocLite('Mus.musculus') which will also install other mouse-related libraries.

ADD REPLY • link updated 4.4 years ago by Ram 43k • written 8.4 years ago by Giovanni M Dall'Olio 28k

0

Entering edit mode

When doing this, aren't you ignoring the strand? If the strand is '-', we should take the end as the start, right?

ADD REPLY • link 7.0 years ago by Roman Hillje ▴ 90