Question: How To Get Gene Regions +3Kb Upstream
4
gravatar for Stephen
4.5 years ago by
Stephen2.5k
Charlottesville Virginia
Stephen2.5k wrote:

Simple question - I need to create a GTF file to use in HTSeq-count that contains gene regions plus 3kb upstream. (Background: doing a MeDIP-seq experiment, want to look for differential methylation in genic and 3kb promoter regions using count based method like edgeR/DESeq).

I was planning on making one myself from the UCSC hg19 refFlat table. The refFlat table has gene coordinates, but I need to extend this 3kb upstream to capture promoter regions.

Column 3 contains the strand (+/-) and columns 4-5 contain the transcription start (txStart) and end (txEnd) positions.

If I want to capture 3kb upstream of the TSS, I was planning on adding 3000 to txStart, but is only for genes on the + strand, correct? If I want 3kb upstream of the TSS for genes on the - strand, should I add this 3000 to txEnd?

E.g. the DENND1B gene is on the - strand at chr1:197,473,879-197,744,623. However, looking at it in the browser, it's transcribed "right to left", so I presume I would want to add 3000 to the txEnd number, 197,744,623, even though this is really where transcription starts.

Am I thinking about this correctly?

gtf ucsc • 3.6k views
ADD COMMENTlink modified 13 months ago by Louis120 • written 4.5 years ago by Stephen2.5k
11
gravatar for Pierre Lindenbaum
4.5 years ago by
France/Nantes/Institut du Thorax - INSERM UMR1087
Pierre Lindenbaum91k wrote:

"If I want 3kb upstream of the TSS for genes on the - strand, should I add this 3000 to txEnd?"

yes.

the position are always ordered from the 5' of the chromosome to the 3' and named xxStart and xxEnd (with xxStart<=xxEnd) whatever the value of "strand".

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select count(*) from refFlat where txStart>txEnd or cdsStart>cdsEnd'
+----------+
| count(*) |
+----------+
|        0 |
+----------+

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select txStart,txEnd from refFlat where strand="-" limit 4'
+-----------+-----------+
| txStart   | txEnd     |
+-----------+-----------+
|  62929370 |  62937380 |
|     34610 |     36081 |
| 153270337 | 153283194 |
| 113235158 | 114449242 |
+-----------+-----------+

So you'll only have to :

newStart=(start<3000?0:start-3000);
newEnd=end+3000;

EDIT: I was too fast. My pseudo-code extends the sequence on both side. If you want to extend upstream sequence :

newStart=(strand=='+'?start<3000?0:start-3000:start);
newEnd=(strand=='+'?end:end+3000);

to fetch the (positive strand) sequence; See

How to get the sequence of a genomic region from UCSC?

and

Extract sequence from the genome?

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Pierre Lindenbaum91k

@Pierre, what your example shows is that the answer to his question is "yes". Since txStart is always <= txEnd. You have to add 3000 to txEnd when strand == -.

ADD REPLYlink written 4.5 years ago by brentp21k
1

oppps!!! . fixed. Thank you http://xkcd.com/745/

enter image description here

ADD REPLYlink modified 4.5 years ago • written 4.5 years ago by Pierre Lindenbaum91k

a bit alarming that you got 4 up-votes with an incorrect response. also, your last code-block with newStart and newEnd should take strand into account.

ADD REPLYlink written 4.5 years ago by brentp21k
7

Lindenbaum's Cat

ADD REPLYlink written 4.5 years ago by Istvan Albert ♦♦ 70k
5

Yes, but, to be fair, Lindenbaum's cat writes java code in its spare time.

ADD REPLYlink written 4.5 years ago by brentp21k

I love solutions that involve mysql queries - by the way this is not unlike a paper reviewing process, does anyone actually redo the analysis to make sure that every single command was right or just look at it and say ok that looks pretty sweet stuff

ADD REPLYlink written 4.5 years ago by Istvan Albert ♦♦ 70k

No he was asking 3 questions (3 question marks). And on the first he said: "If I want to capture 3kb upstream of the TSS, I was planning on adding 3000 to txStart, but is only for genes on the + strand, correct?"

This is wrong.

For a + gene adding to the TSS goes into the gene, not upstream of the gene. This is only correct for - genes.

ADD REPLYlink written 4.5 years ago by Ido Tamir4.7k
2
gravatar for brentp
4.5 years ago by
brentp21k
Salt Lake City, UT
brentp21k wrote:

UCSC always stores the data such that start <= end. So, yes, the actual transcription start is

  • txStart if strand == "+"

and

  • txEnd if strand == '-'.
ADD COMMENTlink written 4.5 years ago by brentp21k
1
gravatar for Ido Tamir
4.5 years ago by
Ido Tamir4.7k
Austria
Ido Tamir4.7k wrote:

Not completely if I followed your reasoning.

   for upstream region only:
   dir    column4               column5 
   +      txstart - 3000        txstart  
   -      txend                 txend + 3000  

   for gene incl upstream region:
   dir    column4               column5 
   +      txstart - 3000        txend  
   -      tstart                txend + 3000

by "adding 3000 to txStart, but is only for genes on the + strand" you get 3kb into the gene. You have to subtract 3000 to get upstream of the transcription site for + genes.

Can one do tables with this editor? html?

ADD COMMENTlink modified 4.5 years ago • written 4.5 years ago by Ido Tamir4.7k
1
gravatar for Louis
13 months ago by
Louis120
United Kingdom
Louis120 wrote:

A solution in R, to answer the original question :

library('GenomicRanges')
library('TxDb.Hsapiens.UCSC.hg19.knownGene')

txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
genes <- as.data.frame(transcripts(txdb_hg19))
colnames(genes)[1:3] <- c("chrom","txStart","txEnd")
tss <- genes$txStart
n.str <- genes$strand == '-'
tss[n.str] <- genes$txEnd[n.str]
tss.us <- tss - 3000
tss.us[n.str] <- tss[n.str] + 3000
tes <- genes$txEnd
tes[n.str] <- genes$txStart[n.str]
genes$txStart <- tss
genes$txUs <- tss.us
genes$txEnd <- tes

# For querying against other genomic ranges make a GRanges object for the
# upstream regions (requires upstream and TSS coordinates in ascending order)

asc.coords <- transform(genes[c("txStart","txUs")],
                        min = ifelse(n.str, txStart, txUs),
                        max = ifelse(n.str, txUs, txStart))
min.coords <- asc.coords$min
max.coords <- asc.coords$max

tss.us.regions <- GRanges(seqnames = Rle(genes$chrom),
                          strand = Rle(genes$strand),
                          ranges = IRanges(min.coords, max.coords),
                          TxID = genes$tx_id) # [example metadata column]
ADD COMMENTlink modified 13 months ago • written 13 months ago by Louis120
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1519 users visited in the last hour