Table browser +/- 2Kb of TSS export
3
1
Entering edit mode
5.2 years ago
rbronste ▴ 400

Wondering about easiest way to get a BED file of regions +/- 2kb around TSSs through table browser?

UCSC table broswer • 4.2k views
5
Entering edit mode
5.2 years ago

If you want to do this on the command line, you can do the following to get hg38 refSeq annotations (adjust for your genome of interest):

$wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz | gunzip -c - | awk 'BEGIN{ OFS="\t" }{ print$3, $5,$6, $13,$10, $4 }' - | sort-bed - > refGene.bed  Then split them by strand and pad around the stranded-start position of the annotation: $ awk '($6 == "+") { print$0 }' refGene.bed | awk 'BEGIN{ OFS="\t" }($2 > 2000){ print$1, ($2 - 2000), ($2 + 2000), $4,$5, $6 }' > refGene.tss.for.padded.bed$ awk '($6 == "-") { print$0 }' refGene.bed | awk 'BEGIN{ OFS="\t" }($3 > 2000){ print$1, ($3 - 2000), ($3 + 2000), $4,$5, $6 }' > refGene.tss.rev.padded.bed$ bedops --everything refGene.tss.for.padded.bed refGene.tss.rev.padded.bed > refGene.tss.padded.bed


Generally, that could give problems if a padded TSS ends up going outside chromosomal bounds. You could filter them out with a mix of Kent and BEDOPS tools:

$fetchChromSizes hg38 | awk '{ print$1"\t0\t"$2; }' | sort-bed - > hg38.bounds.bed$ bedops --element-of 100% refGene.tss.padded.bed hg38.bounds.bed > refGene.tss.padded.filtered.bed


For 2k of padding, this is probably not a major issue. I don't believe one would generally find many genes (and their TSSs) within 2k of the tips of chromosomes. Nonetheless, this would deal with this scenario in a general way.

0
Entering edit mode

Very helpful thanks! I am trying to modify this with a subset of TSSs from genes I have highly expressed in an RNAseq experiment, what would be the most straightforward way to do that?

0
Entering edit mode

Once you make the file refGene.tss.padded.bed, you could filter this with bedops --element-of 1. Say you have a sort-bed-sorted set of TSSs from your RNAseq experiment, which is called rnaseq.tss.bed:

$sort-bed rnaseq.tss.unsorted.bed > rnaseq.tss.bed$ bedops --element-of 1 refGene.tss.padded.bed rnaseq.tss.bed > filtered.refGene.tss.padded.bed


The file filtered.refGene.tss.padded.bed will contain a subset of TSSs from refGene.tss.padded.bed that overlap the set of RNAseq-derived TSSs.

0
Entering edit mode

Thanks again! One final question, in this particular instance if I want to use the refGene.tss.padded.bed file as a drop out, for instance take a set of intervals and drop out refGene.tss.padded.bed intervals what would be the most straightforward way to do that? Just want a final bed file without those intersections of "promoters".

1
Entering edit mode

The option --not-element-of or -n may be what you're after. It works like the inverse of --element-of. That is, given:

$bedops -n 1 refGene.tss.padded.bed intervals.bed > answer.bed  The file answer.bed will contain elements from refGene.tss.padded.bed that do not overlap intervals.bed by one or more bases. This is the least stringent overlap criterion. If you want to filter out only those elements that overlap completely, use --not-element-of 100% or -n 100% instead of -n 1. This case is the most stringent overlap criterion — all bases of an element of refGene.tss.padded.bed must overlap an element in intervals.bed to be filtered out. ADD REPLY 0 Entering edit mode Its strange but when I do this backwards and forwards with my refGene.tss.padded.bed and target "intervals.bed" I get that answer.bed is a full overlap, for instance if my intervals.bed has 1000 intervals and refGene.tss.padded.bed has 39702 intervals and I do the following: bedops -n 1 refGene.tss.padded.bed intervals.bed > answer.bed wc -l answer.bed 39702 answer.bed  Tried it with other data I have and the same happens. I should mention that I know some of those 1000 intervals fall within +/- 2kb of TSS. Am I doing something incorrectly? Thanks! ADD REPLY 0 Entering edit mode I would suggest checking that both inputs are sorted per sort-bed. I would also do a similar bedops -e 1 and wc -l test. You should get the inverse result, ie if there are 1000 elements and you expect that 800 of them do not overlap another set, then the remaining 200 should, if that makes sense. ADD REPLY 0 Entering edit mode If I have a list of common gene names such as the following: Kcnip1 Nat9 Zfp523 Akap1 Kif19a St18 Mettl1 Uba7 Pde3b Gstm7 H13 Slc20a1 Lrig2 Elovl2 Zfp185 Trmt112 Egfr Klhdc1 Cand2 Macrod1 Kcns1 Slc35a5 Anp32b  Can the approach be modified to only get the TSS information for those particular genes? I really like the way bedops does it here but couldn't figure this out. Thank you! ADD REPLY 1 Entering edit mode Can you tell me what assembly you're working with? I'm guessing mouse, but I don't know which one. Let me know and I'll paste in some scripts that might help you. ADD REPLY 1 Entering edit mode Here's one way you might get mouse gene TSSs for mm10 or GRCm38: $ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/gencode.vM18.annotation.gff3.gz \
| gunzip --stdout - \
| awk '$3 == "gene"' - \ | convert2bed -i gff - \ | awk -vwindow=1 -vOFS="\t" '($6=="+"){ print $1, ($2 - window), $2,$4, ".", $6,$7, $8,$9, $10 }($6=="-"){ print $1,$3, ($3 + window),$4, ".", $6,$7, $8,$9, $10 }' \ > gencode.vM18.tss.bed  Then to filter them: $ grep -w -F -i -f genes.txt gencode.vM18.tss.bed > gencode.vM18.tss.filtered.bed

1
Entering edit mode

this should also work for gencode gtf files as well

0
Entering edit mode

Yes I am using mm10, this is very helpful thank you!

0
Entering edit mode

One final question on this, I am trying to see if these particular peaks are within a certain distance of the TSSs this code allows me to isolate, lets say 1MB on either side but if I wanted to to adjust in the downstream direction also for the length of the gene is there a way that closest-feature can accomplish this?

1
Entering edit mode

I'm sorry, but I don't understand your question. Can you describe a little more what operation you're trying to do, or what the adjustment is?

0
Entering edit mode

Basically Id like to take this set of filtered TSSs and use closest-features to filter a peak list for those peaks that fall within lets say 1 mb of this TSS list, however adjusting the downstream measure by the length of the gene harboring the TSS. Does that sort of make sense? Basically I guess its a modification of the --dist flag in closest-features.

1
Entering edit mode

Removing --closest and using the --dist operator will report distances for the nearest elements upstream and downstream of the padded TSS.

You could pipe the result of that to awk or a Python script, using split() or similar function to chop up each result line, and use the distance measure with the size of the gene to apply your logic, to decide what to do with that result.

The output of --dist uses the the | delimiter, useful for downstream parsing. You can use the --delim <delimiter> operation to change this delimiter.

Note that adding --no-overlaps can help to avoid reporting an overlapping element as a "nearest neighbor". You might want to be careful of those guys.

There are a couple files to play with at the bottom of the documentation page, which can help with a bit more of a concrete demo: http://bedops.readthedocs.io/en/latest/content/reference/set-operations/closest-features.html

0
Entering edit mode

Is there a way to set the --delim to establish the distance as a separate column?

1
Entering edit mode

You could use \t (tab) to make distance its own column in a tab-delimited file.

0
Entering edit mode

Yes I actually tried this (--delim \t) but comes up looking like the following:

0.000758255664811437t-42960


So its taking the last column in my peak file (FDR) and adding on the distance. Kind of odd.

1
Entering edit mode

Try adding ticks: --delim '\t'.

0
Entering edit mode

The gencode retrieval you suggest here is very helpful, I inputted a .txt file of 380 common gene names and received back a gencode.vM18.tss.filtered.bed that had 359 IDs. Is there a way to easily identify which were not converted and maybe a way to add those as well? Or do you do this manually? I assume its a lot of riken type things but unsure. Thanks again!

0
Entering edit mode

You could extract the list of gene names that were matched:

$awk -F";" '{sub(/gene_name=/,""); print$4;}' gencode.vM18.tss.filtered.bed > matches.txt


Then grep -v to find genes from genes.txt that do not have matches in matches.txt:

$grep -v -w -F -i -f matches.txt genes.txt > not-found.txt  Then maybe explore not-found.txt to see what tweaks are needed, if any can be done, to relax your search parameters. ADD REPLY 1 Entering edit mode 5.2 years ago genecats.ucsc ▴ 580 The general idea is to choose BED as the output format, and from the "get output" page add some padding bases to the "upstream" and "downstream" options. Unfortunately you can't add both upstream and downstream bases at the same time, so you'll have to merge the files when you're done. The txStart field of the gene track you're interested in will be the TSS. Alternatively, you could choose the selected fields from primary and related tables option, and the select the chrom, txStart, and txEnd fields, and use awk or something similar to add/subtract 2kb to each position, but this gets somewhat tricky for negative strand genes, so the above option is recommended instead. Please be aware that UCSC has a publicly archived mailing list that you can search to see if your question has been previously answered. Here is an example of a similar question from our archive: https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/genome/find$20TSS/genome/567C63DhAo0/9p5vA5PktAYJ

The general link to the mailing list is here, where you can search for and ask UCSC Genome Browser specific questions: https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome

Thanks,

ChrisL from the UCSC Genome Browser

0
Entering edit mode

Actually the awk script in the thread below does this really nicely with a simple BED file of all refSeq genes.

A: How to get promoter coordinates of hg19 from UCSC genome browser ?

0
Entering edit mode
5.2 years ago
mmmmcandrew ▴ 150

In addition to the answers above, another way to do this is with bedtools slop. You'll need the bed file containing the TSS from UCSC table browser, as well as a genome file. This will add the requested amount of base pairs to each end, accounting for chromosome sizes, etc. using the genome file.