Question: retrieving refseq/ucsc transcription start site with 1kb up- and downstream flank?
0
gravatar for taniasultana2004
4.1 years ago by
France
taniasultana20040 wrote:

Hi! 

I need to plot some genomic locations in context of their distances from transcription start site, CpG islands and Dnase hypersensitive sites.

Question1) Can anyone please tell me how I can get ucsc/refseq transcription start site with 1kb up- and downstream flank from ucsc table browser? In the output, i see the following options, I am not clear which options to choose to get what I want. If I just retrieve base pairs upstream to genes, I will miss the alternate TSS.

"

 Whole Gene  
 Upstream by  bases
 Exons plus  bases at each end
 Introns plus  bases at each end
 5' UTR Exons  
 Coding Exons  
 3' UTR Exons  
 Downstream by  bases

"

Question 2) How can I get lists of CpG islands and DNase hypersensitive sites of human genome?

Thanks a lot for your time and help! 

ADD COMMENTlink modified 3.1 years ago by Biostar ♦♦ 20 • written 4.1 years ago by taniasultana20040
3
gravatar for Alex Reynolds
4.1 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

For DNaseI hypersensitive sites in hg19, there are tracks for various cell types on UCSC's site, with the prefix "wgEncodeAwgDnaseUw*":

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database

For example, for the SkMC cell line:

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/wgEncodeAwgDnaseUwSkmcUniPk.txt.gz

The description of columns in this file is available here:

http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/wgEncodeAwgDnaseUwSkmcUniPk.sql

From this description, you could build a BED file via the following call:

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenpath/hg19/database/wgEncodeAwgDnaseUwSkmcUniPk.txt.gz \
    | gunzip -c - \
    | cut -f2- - \
    > wgEncodeAwgDnaseUwSkmcUniPk.bed

Once you have a BED file for your DNaseI dataset of interest, you can do BEDOPS set operations with this and other BED files that share the same chromosome naming scheme and genome build.

ADD COMMENTlink written 4.1 years ago by Alex Reynolds29k

Dear Alex Reynolds, thanks a lot for your help!

ADD REPLYlink modified 10 days ago by RamRS25k • written 4.1 years ago by taniasultana20040
2
gravatar for Alex Reynolds
4.1 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

You could get GRCh38 RefSeq entries from NCBI, convert them to BED via BEDOPS gff2bed, and filter them for genes via awk:

$ wget -qO- ftp://ftp.ncbi.nlm.nih.gov/genomes/H_sapiens/ARCHIVE/ANNOTATION_RELEASE.106/GFF/ref_GRCh38_top_level.gff3.gz \
    | gunzip -c - \
    | gff2bed - \
    | awk '$8=="gene"' - \
    > ref_GRCh38_top_level.genes.bed

The entries are stranded - the strand information is in the sixth column of the BED file - and so you could take the TSS from the start position of the stranded gene record with awk:

$ awk '{ \
     if ($6 == "+") { \
        print $1"\t"$2"\t"($2+1)"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10; \
     } \
     else { \
        print $1"\t"($3-1)"\t"$3"\t"$4"\t"$5"\t"$6"\t"$7"\t"$8"\t"$9"\t"$10; \
     } \
  }' ref_GRCh38_top_level.genes.bed > ref_GRCh38_top_level.genes.TSS.bed

Once you have the TSSs, you can pad them and get flanked output with BEDOPS bedops --range:

$ bedops --range 1000 --everything ref_GRCh38_top_level.genes.TSS.bed > ref_GRCh38_top_level.genes.TSS.1kb_flank.bed
ADD COMMENTlink modified 11 days ago by RamRS25k • written 4.1 years ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 937 users visited in the last hour