How To Get Gene Regions +3Kb Upstream
4
5
Entering edit mode
11.8 years ago
Stephen 2.8k

Simple question - I need to create a GTF file to use in HTSeq-count that contains gene regions plus 3kb upstream. (Background: doing a MeDIP-seq experiment, want to look for differential methylation in genic and 3kb promoter regions using count based method like edgeR/DESeq).

I was planning on making one myself from the UCSC hg19 refFlat table. The refFlat table has gene coordinates, but I need to extend this 3kb upstream to capture promoter regions.

Column 3 contains the strand (+/-) and columns 4-5 contain the transcription start (txStart) and end (txEnd) positions.

If I want to capture 3kb upstream of the TSS, I was planning on adding 3000 to txStart, but is only for genes on the + strand, correct? If I want 3kb upstream of the TSS for genes on the - strand, should I add this 3000 to txEnd?

E.g. the DENND1B gene is on the - strand at chr1:197,473,879-197,744,623. However, looking at it in the browser, it's transcribed "right to left", so I presume I would want to add 3000 to the txEnd number, 197,744,623, even though this is really where transcription starts.

Am I thinking about this correctly?

ucsc gtf • 10.0k views
ADD COMMENT
12
Entering edit mode
11.8 years ago

"If I want 3kb upstream of the TSS for genes on the - strand, should I add this 3000 to txEnd?"

yes.

the position are always ordered from the 5' of the chromosome to the 3' and named xxStart and xxEnd (with xxStart<=xxEnd) whatever the value of "strand".

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select count(*) from refFlat where txStart>txEnd or cdsStart>cdsEnd'
+----------+
| count(*) |
+----------+
|        0 |
+----------+

$ mysql --user=genome --host=genome-mysql.cse.ucsc.edu -A -D hg19 -e 'select txStart,txEnd from refFlat where strand="-" limit 4'
+-----------+-----------+
| txStart   | txEnd     |
+-----------+-----------+
|  62929370 |  62937380 |
|     34610 |     36081 |
| 153270337 | 153283194 |
| 113235158 | 114449242 |
+-----------+-----------+

So you'll only have to :

newStart=(start<3000?0:start-3000);
newEnd=end+3000;

EDIT: I was too fast. My pseudo-code extends the sequence on both side. If you want to extend upstream sequence :

newStart=(strand=='+'?start<3000?0:start-3000:start);
newEnd=(strand=='+'?end:end+3000);

to fetch the (positive strand) sequence; See

How to get the sequence of a genomic region from UCSC?

and

Extract sequence from the genome?

ADD COMMENT
1
Entering edit mode

@Pierre, what your example shows is that the answer to his question is "yes". Since txStart is always <= txEnd. You have to add 3000 to txEnd when strand == -.

ADD REPLY
1
Entering edit mode

oppps!!! . fixed. Thank you http://xkcd.com/745/

ADD REPLY
0
Entering edit mode

a bit alarming that you got 4 up-votes with an incorrect response. also, your last code-block with newStart and newEnd should take strand into account.

ADD REPLY
10
Entering edit mode

Lindenbaum's Cat

ADD REPLY
6
Entering edit mode

Yes, but, to be fair, Lindenbaum's cat writes java code in its spare time.

ADD REPLY
0
Entering edit mode

I love solutions that involve mysql queries - by the way this is not unlike a paper reviewing process, does anyone actually redo the analysis to make sure that every single command was right or just look at it and say ok that looks pretty sweet stuff

ADD REPLY
0
Entering edit mode

No he was asking 3 questions (3 question marks). And on the first he said: "If I want to capture 3kb upstream of the TSS, I was planning on adding 3000 to txStart, but is only for genes on the + strand, correct?"

This is wrong.

For a + gene adding to the TSS goes into the gene, not upstream of the gene. This is only correct for - genes.

ADD REPLY
2
Entering edit mode
11.8 years ago
Ido Tamir 5.2k

Not completely if I followed your reasoning.

for upstream region only:
dir    column4               column5 
+      txstart - 3000        txstart  
-      txend                 txend + 3000  

for gene incl upstream region:
dir    column4               column5 
+      txstart - 3000        txend  
-      tstart                txend + 3000

by "adding 3000 to txStart, but is only for genes on the + strand" you get 3kb into the gene.

You have to subtract 3000 to get upstream of the transcription site for + genes.

Can one do tables with this editor? html?

ADD COMMENT
2
Entering edit mode
11.8 years ago
brentp 24k

UCSC always stores the data such that start <= end. So, yes, the actual transcription start is

  • txStart if strand == "+"

and

  • txEnd if strand == '-'.
ADD COMMENT
2
Entering edit mode
8.4 years ago
Louis ▴ 160

A solution in R, to answer the original question :

library('GenomicRanges')
library('TxDb.Hsapiens.UCSC.hg19.knownGene')

txdb_hg19 <- TxDb.Hsapiens.UCSC.hg19.knownGene
genes <- as.data.frame(transcripts(txdb_hg19))
colnames(genes)[1:3] <- c("chrom","txStart","txEnd")
tss <- genes$txStart
n.str <- genes$strand == '-'
tss[n.str] <- genes$txEnd[n.str]
tss.us <- tss - 3000
tss.us[n.str] <- tss[n.str] + 3000
tes <- genes$txEnd
tes[n.str] <- genes$txStart[n.str]
genes$txStart <- tss
genes$txUs <- tss.us
genes$txEnd <- tes

# For querying against other genomic ranges make a GRanges object for the
# upstream regions (requires upstream and TSS coordinates in ascending order)

asc.coords <- transform(genes[c("txStart","txUs")],
            min = ifelse(n.str, txStart, txUs),
            max = ifelse(n.str, txUs, txStart))
min.coords <- asc.coords$min
max.coords <- asc.coords$max

tss.us.regions <- GRanges(seqnames = Rle(genes$chrom),
                 strand = Rle(genes$strand),
                 ranges = IRanges(min.coords, max.coords),
                 TxID = genes$tx_id) # [example metadata column]
ADD COMMENT

Login before adding your answer.

Traffic: 3408 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6