retrieving refseq/ucsc transcription start site with 1kb up- and downstream flank?
2
1
Entering edit mode
8.5 years ago

Hi!

I need to plot some genomic locations in context of their distances from transcription start site, CpG islands and Dnase hypersensitive sites.

Question1) Can anyone please tell me how I can get ucsc/refseq transcription start site with 1kb up- and downstream flank from ucsc table browser? In the output, I see the following options, I am not clear which options to choose to get what I want. If I just retrieve base pairs upstream to genes, I will miss the alternate TSS.

Whole Gene
Upstream by bases
Exons plus bases at each end
Introns plus bases at each end
5' UTR Exons
Coding Exons
3' UTR Exons     
Downstream by bases

Question 2) How can I get lists of CpG islands and DNase hypersensitive sites of human genome?

Thanks a lot for your time and help!

ucsc transcription-start-site • 3.8k views
ADD COMMENT
3
Entering edit mode
8.5 years ago

For DNaseI hypersensitive sites in hg19, there are tracks for various cell types on UCSC's site, with the prefix wgEncodeAwgDnaseUw*:

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database

For example, for the SkMC cell line:

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/wgEncodeAwgDnaseUwSkmcUniPk.txt.gz

The description of columns in this file is available here:

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/wgEncodeAwgDnaseUwSkmcUniPk.sql

From this description, you could build a BED file via the following call:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/wgEncodeAwgDnaseUwSkmcUniPk.txt.gz \
    | gunzip -c - \
    | cut -f2- - \
    > wgEncodeAwgDnaseUwSkmcUniPk.bed

Once you have a BED file for your DNaseI dataset of interest, you can do BEDOPS set operations with this and other BED files that share the same chromosome naming scheme and genome build.

ADD COMMENT
0
Entering edit mode

Dear Alex Reynolds, thanks a lot for your help!

ADD REPLY
2
Entering edit mode
8.5 years ago

You could get GRCh38 RefSeq entries from NCBI, convert them to BED via BEDOPS gff2bed, and filter them for genes via awk:

$ wget -qO- ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.106/GFF/ref_GRCh38_top_level.gff3.gz \
    | gunzip -c - \
    | gff2bed - \
    | awk '$8=="gene"' - \
    > ref_GRCh38_top_level.genes.bed

The entries are stranded - the strand information is in the sixth column of the BED file - and so you could take the TSS from the start position of the stranded gene record with awk:

$ awk '{ \
     if ($6 == "+") { \
        print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10; \
     } \
     else { \
        print $1"\t"($3-1)"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10; \
     } \
  }' ref_GRCh38_top_level.genes.bed > ref_GRCh38_top_level.genes.TSS.bed

Once you have the TSSs, you can pad them and get flanked output with BEDOPS bedops --range:

$ bedops --range 1000 --everything ref_GRCh38_top_level.genes.TSS.bed > ref_GRCh38_top_level.genes.TSS.1kb_flank.bed
ADD COMMENT

Login before adding your answer.

Traffic: 3222 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6