Question

TSS file for D.melanogaster

0

Entering edit mode

9.8 years ago

catherine ▴ 250

I have chip-seq data, and I want to exclude the regions near TSS. Can anyone tell me how to get TSS file? I went to UCSC but didn't find it.

Thanks a lot for any advise in advance.

tss drosophila • 4.4k views

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by catherine ▴ 250

1

Entering edit mode

7.8 years ago

spencernystrom ▴ 10

For anyone that might still find this, the proposed solutions to use an SQL query at UCSC will not give you an accurate number of TSS's. UCSC's annotated TSS data only has about 6100 TSS's, which is way less than the number of known TSS's. I haven't found a more complete solution but I'll update when I do.

ADD COMMENT • link 7.8 years ago by spencernystrom ▴ 10

0

Entering edit mode

9.8 years ago

Alex Reynolds 35k

You can do a MySQL query of the UCSC Genome Browser, to output a sorted six-column BED file containing unique RefSeq records:

$ mysql -h genome-mysql.cse.ucsc.edu -u genome -D dm3 -N -A -e 'select chrom, txStart, txEnd, name2, score, strand from refGene' \
    | sort-bed - \
    | awk 'elements[$0]++ == 1' - \
    > refseq_tss.bed

Once you have both the RefSeq TSSs and your ChIP-seq data in sorted BED format, you can use bedops --range --not-element-of on these two datasets to exclude any ChIP-seq peaks that fall in a window around each TSS.

See the following docs for more information on these and other bedops operations. Also, the table schema for Drosophila RefSeq is available here, so you can see where those field names come from and what they map to.

ADD COMMENT • link updated 2.4 years ago by Ram 43k • written 9.8 years ago by Alex Reynolds 35k

0

Entering edit mode

Thank you and I got it!

ADD REPLY • link 9.8 years ago by catherine ▴ 250

Ram · Accepted Answer · 2014-07-11

Within UCSC you can get the data you want.

First make sure you are currently viewing the right genome, e.g. DM3.

Select 'Tools' (along the top of the screen) > 'Table Browser' to access the tables of data used by UCSC.

Choose: 'group' = 'Genes and Gene Predictions', 'track' (depending on you preference) = 'RefSeq Genes' or 'FlyBase Genes'.

If you select 'output format' = 'BED' when you press 'get output' you will be given the option to 'Create one BED record per' > 'Upstream by N bases'

The resulting output file (to screen if you did not give a file name in the previous screen) will contain the coordinates of the promoter region for your analysis. Bear in mind that the coordinates are for transcripts (i.e. more than one transcript per gene).

Hope this helps.