How to obtain TSS from list of given genes
1
1
Entering edit mode
8.4 years ago
tanni93 ▴ 50

Hi all,

I have a list of genes (see format below) and I would like to obtain the TSS for all 19 of these genes. How can I do this?

ENSMUSG00000029847
ENSMUSG00000085236
ENSMUSG00000063364
ENSMUSG00000085247
ENSMUSG00000072893
ENSMUSG00000018800
ENSMUSG00000020865
ENSMUSG00000023832
ENSMUSG00000045730
ENSMUSG00000007827
ENSMUSG00000021950
ENSMUSG00000071847
ENSMUSG00000025154
ENSMUSG00000047446
ENSMUSG00000026628
ENSMUSG00000029673
RNA-Seq r • 8.6k views
ADD COMMENT
0
Entering edit mode

could you provide the format, please

ADD REPLY
0
Entering edit mode
ADD REPLY
1
Entering edit mode
8.4 years ago

There are many ways to do this. Apart from biomaRt, you can use the Mus musculus TxDb object:

> source("https://bioconductor.org/biocLite.R")
> biocLite('TxDb.Mmusculus.UCSC.mm10.ensGene')
> mygenes = c("ENSMUSG00000029847",
"ENSMUSG00000085236", "ENSMUSG00000063364","ENSMUSG00000085247",
"ENSMUSG00000072893","ENSMUSG00000018800","ENSMUSG00000020865",
"ENSMUSG00000023832","ENSMUSG00000045730","ENSMUSG00000007827",
"ENSMUSG00000021950","ENSMUSG00000071847","ENSMUSG00000025154",
"ENSMUSG00000047446","ENSMUSG00000026628","ENSMUSG00000029673")

> mygenes.transcripts = subset(transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns=c("tx_id", "tx_name","gene_id")), gene_id %in% mygenes)
GRanges object with 62 ranges and 3 metadata columns:
       seqnames                 ranges strand   |     tx_id            tx_name            gene_id
          <Rle>              <IRanges>  <Rle>   | <integer>        <character>    <CharacterList>
   [1]     chr1 [191170296, 191183340]      -   |      4925 ENSMUST00000027941 ENSMUSG00000026628
   [2]     chr1 [191171425, 191183108]      -   |      4926 ENSMUST00000131854 ENSMUSG00000026628
   [3]     chr2 [135169573, 135215616]      -   |     13731 ENSMUST00000138303 ENSMUSG00000085247
   [4]     chr2 [150310935, 150362765]      -   |     13893 ENSMUST00000051153 ENSMUSG00000063364
   [5]     chr2 [150310937, 150362733]      -   |     13894 ENSMUST00000124945 ENSMUSG00000063364
   ...      ...                    ...    ... ...       ...                ...                ...
  [58]    chr18   [62177817, 62179959]      -   |     86501 ENSMUST00000053640 ENSMUSG00000045730
  [59]    chr19   [41766588, 41802047]      -   |     88750 ENSMUST00000026150 ENSMUSG00000025154
  [60]    chr19   [41766591, 41802084]      -   |     88751 ENSMUST00000163265 ENSMUSG00000025154
  [61]    chr19   [41769800, 41781336]      -   |     88752 ENSMUST00000176266 ENSMUSG00000025154
  [62]    chr19   [41769994, 41802047]      -   |     88753 ENSMUST00000177495 ENSMUSG00000025154

This will create a GenomicRanges object called mygenes.transcripts, from which you can access the coordinates of all transcripts of each gene. If you want to avoid complications and prefer to have just one coordinate per gene, use genes() instead of transcripts()

To get the TSS, just resize the object to one base:

> mygenes.tss = resize(mygenes.transcripts, width=1, fix='start')
ADD COMMENT
0
Entering edit mode

I ran into some errors and had to specify the package namespaces and load the GenomicFeatures library:

> source("https://bioconductor.org/biocLite.R")
...
> biocLite('TxDb.Mmusculus.UCSC.mm10.ensGene')
...
> library("GenomicFeatures")
...
> mygenes = c("ENSMUSG00000029847",
"ENSMUSG00000085236", "ENSMUSG00000063364","ENSMUSG00000085247",
"ENSMUSG00000072893","ENSMUSG00000018800","ENSMUSG00000020865",
"ENSMUSG00000023832","ENSMUSG00000045730","ENSMUSG00000007827",
"ENSMUSG00000021950","ENSMUSG00000071847","ENSMUSG00000025154",
"ENSMUSG00000047446","ENSMUSG00000026628","ENSMUSG00000029673")
> mygenes.transcripts = subset(GenomicFeatures::transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene::TxDb.Mmusculus.UCSC.mm10.ensGene, columns=c("tx_id", "tx_name","gene_id")), gene_id %in% mygenes)
GRanges object with 62 ranges and 3 metadata columns:
       seqnames                 ranges strand   |     tx_id            tx_name
          <Rle>              <IRanges>  <Rle>   | <integer>        <character>
   [1]     chr1 [191170296, 191183340]      -   |      4925 ENSMUST00000027941
   [2]     chr1 [191171425, 191183108]      -   |      4926 ENSMUST00000131854
   [3]     chr2 [135169573, 135215616]      -   |     13731 ENSMUST00000138303
   [4]     chr2 [150310935, 150362765]      -   |     13893 ENSMUST00000051153
   [5]     chr2 [150310937, 150362733]      -   |     13894 ENSMUST00000124945
   ...      ...                    ...    ... ...       ...                ...
  [58]    chr18   [62177817, 62179959]      -   |     86501 ENSMUST00000053640
  [59]    chr19   [41766588, 41802047]      -   |     88750 ENSMUST00000026150
  [60]    chr19   [41766591, 41802084]      -   |     88751 ENSMUST00000163265
  [61]    chr19   [41769800, 41781336]      -   |     88752 ENSMUST00000176266
  [62]    chr19   [41769994, 41802047]      -   |     88753 ENSMUST00000177495
                  gene_id
          <CharacterList>
   [1] ENSMUSG00000026628
   [2] ENSMUSG00000026628
   [3] ENSMUSG00000085247
   [4] ENSMUSG00000063364
   [5] ENSMUSG00000063364
   ...                ...
  [58] ENSMUSG00000045730
  [59] ENSMUSG00000025154
  [60] ENSMUSG00000025154
  [61] ENSMUSG00000025154
  [62] ENSMUSG00000025154
  -------
  seqinfo: 66 sequences (1 circular) from mm10 genome

It looks like the fix setting in resize uses the correct start position for the interval's strand. Note that these resized coordinates are 1-based, if the OP plans to ultimately export BED data.

ADD REPLY
0
Entering edit mode

Thank you for the correction. Indeed many Bioconductor libraries have problems of namespaces. In particular loading dplyr after AnnotationDbi causes a lot of naming conflicts, e.g. on the select function.

ADD REPLY
0
Entering edit mode

Hi Giovanni,

Thank you for the help!

A few questions: (I am relevatively new to R).

for mygenes.transcripts, I had copy and pasted

mygenes.transcripts = subset(transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns=c("tx_id", "tx_name","gene_id")), gene_id %in% mygenes)

but error keeps saying

Error in subset(transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns = c("tx_id",  : 
  error in evaluating the argument 'x' in selecting a method for function 'subset': Error in transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns = c("tx_id",  : 
  error in evaluating the argument 'x' in selecting a method for function 'transcripts': Error: object 'TxDb.Mmusculus.UCSC.mm10.ensGene' not found
ADD REPLY
0
Entering edit mode

Hi Giovanni,

Thank you for the help!

A few questions: (I am relevatively new to R).

for mygenes.transcripts, I had copy and pasted

mygenes.transcripts = subset(transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns=c("tx_id", "tx_name","gene_id")), gene_id %in% mygenes)

but error keeps saying

Error in subset(transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns = c("tx_id",  : 
  error in evaluating the argument 'x' in selecting a method for function 'subset': Error in transcripts(TxDb.Mmusculus.UCSC.mm10.ensGene, columns = c("tx_id",  : 
  error in evaluating the argument 'x' in selecting a method for function 'transcripts': Error: object 'TxDb.Mmusculus.UCSC.mm10.ensGene' not found
ADD REPLY
0
Entering edit mode

Also, on a separate note: how do I format mygenes if I have about 150 genes? Would I have to manually type in these ensemble IDs or is there a shortcut?

Thank you in advance,
Tanni

ADD REPLY
0
Entering edit mode

Save your ensembl IDs in a file, one per line, then read it with read.csv('filename', header=FALSE)

ADD REPLY
0
Entering edit mode

You need to install the Mus musculus TxDb package in your system. The easiest way is to run biocLite('Mus.musculus') which will also install other mouse-related libraries.

ADD REPLY
0
Entering edit mode

When doing this, aren't you ignoring the strand? If the strand is '-', we should take the end as the start, right?

ADD REPLY

Login before adding your answer.

Traffic: 1517 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6