Question

retrieving refseq/ucsc transcription start site with 1kb up- and downstream flank?

1

Entering edit mode

8.5 years ago

taniasultana2004 ▴ 10

Hi!

I need to plot some genomic locations in context of their distances from transcription start site, CpG islands and Dnase hypersensitive sites.

Question1) Can anyone please tell me how I can get ucsc/refseq transcription start site with 1kb up- and downstream flank from ucsc table browser? In the output, I see the following options, I am not clear which options to choose to get what I want. If I just retrieve base pairs upstream to genes, I will miss the alternate TSS.

Whole Gene
Upstream by bases
Exons plus bases at each end
Introns plus bases at each end
5' UTR Exons
Coding Exons
3' UTR Exons     
Downstream by bases

Question 2) How can I get lists of CpG islands and DNase hypersensitive sites of human genome?

Thanks a lot for your time and help!

ucsc transcription-start-site • 3.8k views

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.5 years ago by taniasultana2004 ▴ 10

2

Entering edit mode

8.5 years ago

Alex Reynolds 35k

You could get GRCh38 RefSeq entries from NCBI, convert them to BED via BEDOPS gff2bed, and filter them for genes via awk:

$ wget -qO- ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.106/GFF/ref_GRCh38_top_level.gff3.gz \
    | gunzip -c - \
    | gff2bed - \
    | awk '$8=="gene"' - \
    > ref_GRCh38_top_level.genes.bed

The entries are stranded - the strand information is in the sixth column of the BED file - and so you could take the TSS from the start position of the stranded gene record with awk:

$ awk '{ \
     if ($6 == "+") { \
        print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10; \
     } \
     else { \
        print $1"\t"($3-1)"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10; \
     } \
  }' ref_GRCh38_top_level.genes.bed > ref_GRCh38_top_level.genes.TSS.bed

Once you have the TSSs, you can pad them and get flanked output with BEDOPS bedops --range:

$ bedops --range 1000 --everything ref_GRCh38_top_level.genes.TSS.bed > ref_GRCh38_top_level.genes.TSS.1kb_flank.bed

ADD COMMENT • link updated 20 months ago by Ram 43k • written 8.5 years ago by Alex Reynolds 35k

Ram · Accepted Answer · 2015-11-04

For DNaseI hypersensitive sites in hg19, there are tracks for various cell types on UCSC's site, with the prefix wgEncodeAwgDnaseUw*:

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database

For example, for the SkMC cell line:

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/wgEncodeAwgDnaseUwSkmcUniPk.txt.gz

The description of columns in this file is available here:

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/wgEncodeAwgDnaseUwSkmcUniPk.sql

From this description, you could build a BED file via the following call:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/wgEncodeAwgDnaseUwSkmcUniPk.txt.gz \
    | gunzip -c - \
    | cut -f2- - \
    > wgEncodeAwgDnaseUwSkmcUniPk.bed

Once you have a BED file for your DNaseI dataset of interest, you can do BEDOPS set operations with this and other BED files that share the same chromosome naming scheme and genome build.