Question: Table browser +/- 2Kb of TSS export
0
gravatar for rbronste
15 months ago by
rbronste170
rbronste170 wrote:

Wondering about easiest way to get a BED file of regions +/- 2kb around TSSs through table browser?

ucsc table broswer • 743 views
ADD COMMENTlink modified 15 months ago by mmmmcandrew60 • written 15 months ago by rbronste170
3
gravatar for Alex Reynolds
15 months ago by
Alex Reynolds25k
Seattle, WA USA
Alex Reynolds25k wrote:

If you want to do this on the command line, you can do the following to get hg38 refSeq annotations (adjust for your genome of interest):

$ wget -qO- http://hgdownload.cse.ucsc.edu/goldenPath/hg38/database/refGene.txt.gz | gunzip -c - | awk 'BEGIN{ OFS="\t" }{ print $3, $5, $6, $13, $10, $4  }' - | sort-bed - > refGene.bed

Then split them by strand and pad around the stranded-start position of the annotation:

$ awk '($6 == "+") { print $0 }' refGene.bed | awk 'BEGIN{ OFS="\t" }($2 > 2000){ print $1, ($2 - 2000), ($2 + 2000), $4, $5, $6  }' > refGene.tss.for.padded.bed
$ awk '($6 == "-") { print $0 }' refGene.bed | awk 'BEGIN{ OFS="\t" }($3 > 2000){ print $1, ($3 - 2000), ($3 + 2000), $4, $5, $6  }' > refGene.tss.rev.padded.bed
$ bedops --everything refGene.tss.for.padded.bed refGene.tss.rev.padded.bed > refGene.tss.padded.bed

Generally, that could give problems if a padded TSS ends up going outside chromosomal bounds. You could filter them out with a mix of Kent and BEDOPS tools:

$ fetchChromSizes hg38 | awk '{ print $1"\t0\t"$2; }' | sort-bed - > hg38.bounds.bed
$ bedops --element-of 100% refGene.tss.padded.bed hg38.bounds.bed > refGene.tss.padded.filtered.bed

For 2k of padding, this is probably not a major issue. I don't believe one would generally find many genes (and their TSSs) within 2k of the tips of chromosomes. Nonetheless, this would deal with this scenario in a general way.

ADD COMMENTlink modified 15 months ago • written 15 months ago by Alex Reynolds25k

Very helpful thanks! I am trying to modify this with a subset of TSSs from genes I have highly expressed in an RNAseq experiment, what would be the most straightforward way to do that?

ADD REPLYlink written 14 months ago by rbronste170

Once you make the file refGene.tss.padded.bed, you could filter this with bedops --element-of 1. Say you have a sort-bed-sorted set of TSSs from your RNAseq experiment, which is called rnaseq.tss.bed:

$ sort-bed rnaseq.tss.unsorted.bed > rnaseq.tss.bed
$ bedops --element-of 1 refGene.tss.padded.bed rnaseq.tss.bed > filtered.refGene.tss.padded.bed

The file filtered.refGene.tss.padded.bed will contain a subset of TSSs from refGene.tss.padded.bed that overlap the set of RNAseq-derived TSSs.

ADD REPLYlink modified 14 months ago • written 14 months ago by Alex Reynolds25k

Thanks again! One final question, in this particular instance if I want to use the refGene.tss.padded.bed file as a drop out, for instance take a set of intervals and drop out refGene.tss.padded.bed intervals what would be the most straightforward way to do that? Just want a final bed file without those intersections of "promoters".

ADD REPLYlink written 7 months ago by rbronste170
1

The option --not-element-of or -n may be what you're after. It works like the inverse of --element-of. That is, given:

$ bedops -n 1 refGene.tss.padded.bed intervals.bed > answer.bed

The file answer.bed will contain elements from refGene.tss.padded.bed that do not overlap intervals.bed by one or more bases. This is the least stringent overlap criterion.

If you want to filter out only those elements that overlap completely, use --not-element-of 100% or -n 100% instead of -n 1. This case is the most stringent overlap criterion — all bases of an element of refGene.tss.padded.bed must overlap an element in intervals.bed to be filtered out.

ADD REPLYlink written 7 months ago by Alex Reynolds25k

Its strange but when I do this backwards and forwards with my refGene.tss.padded.bed and target "intervals.bed" I get that answer.bed is a full overlap, for instance if my intervals.bed has 1000 intervals and refGene.tss.padded.bed has 39702 intervals and I do the following:

bedops -n 1 refGene.tss.padded.bed intervals.bed > answer.bed
wc -l answer.bed
39702 answer.bed

Tried it with other data I have and the same happens. I should mention that I know some of those 1000 intervals fall within +/- 2kb of TSS. Am I doing something incorrectly? Thanks!

ADD REPLYlink modified 7 months ago • written 7 months ago by rbronste170

I would suggest checking that both inputs are sorted per sort-bed. I would also do a similar bedops -e 1 and wc -l test. You should get the inverse result, ie if there are 1000 elements and you expect that 800 of them do not overlap another set, then the remaining 200 should, if that makes sense.

ADD REPLYlink written 7 months ago by Alex Reynolds25k

If I have a list of common gene names such as the following:

Kcnip1
Nat9
Zfp523
Akap1
Kif19a
St18
Mettl1
Uba7
Pde3b
Gstm7
H13
Slc20a1
Lrig2
Elovl2
Zfp185
Trmt112
Egfr
Klhdc1
Cand2
Macrod1
Kcns1
Slc35a5
Anp32b

Can the approach be modified to only get the TSS information for those particular genes? I really like the way bedops does it here but couldn't figure this out. Thank you!

ADD REPLYlink written 8 days ago by rbronste170
1

Can you tell me what assembly you're working with? I'm guessing mouse, but I don't know which one. Let me know and I'll paste in some scripts that might help you.

ADD REPLYlink written 8 days ago by Alex Reynolds25k
1

Here's one way you might get mouse gene TSSs for mm10 or GRCm38:

$ wget -qO- ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M18/gencode.vM18.annotation.gff3.gz \
    | gunzip --stdout - \
    | awk '$3 == "gene"' - \
    | convert2bed -i gff - \
    | awk -vwindow=1 -vOFS="\t" '($6=="+"){ print $1, ($2 - window), $2, $4, ".", $6, $7, $8, $9, $10 }($6=="-"){ print $1, $3, ($3 + window), $4, ".", $6, $7, $8, $9, $10 }' \
    > gencode.vM18.tss.bed

Then to filter them:

$ grep -w -F -i -f genes.txt gencode.vM18.tss.bed > gencode.vM18.tss.filtered.bed
ADD REPLYlink modified 8 days ago • written 8 days ago by Alex Reynolds25k

Yes I am using mm10, this is very helpful thank you!

ADD REPLYlink written 7 days ago by rbronste170

One final question on this, I am trying to see if these particular peaks are within a certain distance of the TSSs this code allows me to isolate, lets say 1MB on either side but if I wanted to to adjust in the downstream direction also for the length of the gene is there a way that closest-feature can accomplish this?

ADD REPLYlink written 7 days ago by rbronste170
1

I'm sorry, but I don't understand your question. Can you describe a little more what operation you're trying to do, or what the adjustment is?

ADD REPLYlink written 7 days ago by Alex Reynolds25k

Basically Id like to take this set of filtered TSSs and use closest-features to filter a peak list for those peaks that fall within lets say 1 mb of this TSS list, however adjusting the downstream measure by the length of the gene harboring the TSS. Does that sort of make sense? Basically I guess its a modification of the --dist flag in closest-features.

ADD REPLYlink written 7 days ago by rbronste170
1

Removing --closest and using the --dist operator will report distances for the nearest elements upstream and downstream of the padded TSS.

You could pipe the result of that to awk or a Python script, using split() or similar function to chop up each result line, and use the distance measure with the size of the gene to apply your logic, to decide what to do with that result.

The output of --dist uses the the | delimiter, useful for downstream parsing. You can use the --delim <delimiter> operation to change this delimiter.

Note that adding --no-overlaps can help to avoid reporting an overlapping element as a "nearest neighbor". You might want to be careful of those guys.

There are a couple files to play with at the bottom of the documentation page, which can help with a bit more of a concrete demo: http://bedops.readthedocs.io/en/latest/content/reference/set-operations/closest-features.html

ADD REPLYlink modified 7 days ago • written 7 days ago by Alex Reynolds25k

Very helpful thank you!

Is there a way to set the --delim to establish the distance as a separate column?

ADD REPLYlink written 4 days ago by rbronste170
1

You could use \t (tab) to make distance its own column in a tab-delimited file.

ADD REPLYlink written 4 days ago by Alex Reynolds25k

Yes I actually tried this (--delim \t) but comes up looking like the following:

0.000758255664811437t-42960

So its taking the last column in my peak file (FDR) and adding on the distance. Kind of odd.

ADD REPLYlink written 4 days ago by rbronste170
1

Try adding ticks: --delim '\t'.

ADD REPLYlink written 4 days ago by Alex Reynolds25k

The gencode retrieval you suggest here is very helpful, I inputted a .txt file of 380 common gene names and received back a gencode.vM18.tss.filtered.bed that had 359 IDs. Is there a way to easily identify which were not converted and maybe a way to add those as well? Or do you do this manually? I assume its a lot of riken type things but unsure. Thanks again!

ADD REPLYlink written 3 days ago by rbronste170

You could extract the list of gene names that were matched:

$ awk -F";" '{sub(/gene_name=/,""); print $4;}' gencode.vM18.tss.filtered.bed > matches.txt

Then grep -v to find genes from genes.txt that do not have matches in matches.txt:

$ grep -v -w -F -i -f matches.txt genes.txt > not-found.txt

Then maybe explore not-found.txt to see what tweaks are needed, if any can be done, to relax your search parameters.

ADD REPLYlink modified 3 days ago • written 3 days ago by Alex Reynolds25k
1
gravatar for genecats.ucsc
15 months ago by
genecats.ucsc510
genecats.ucsc510 wrote:

The general idea is to choose BED as the output format, and from the "get output" page add some padding bases to the "upstream" and "downstream" options. Unfortunately you can't add both upstream and downstream bases at the same time, so you'll have to merge the files when you're done. The txStart field of the gene track you're interested in will be the TSS.

Alternatively, you could choose the selected fields from primary and related tables option, and the select the chrom, txStart, and txEnd fields, and use awk or something similar to add/subtract 2kb to each position, but this gets somewhat tricky for negative strand genes, so the above option is recommended instead.

Please be aware that UCSC has a publicly archived mailing list that you can search to see if your question has been previously answered. Here is an example of a similar question from our archive: https://groups.google.com/a/soe.ucsc.edu/forum/#!searchin/genome/find$20TSS/genome/567C63DhAo0/9p5vA5PktAYJ

The general link to the mailing list is here, where you can search for and ask UCSC Genome Browser specific questions: https://groups.google.com/a/soe.ucsc.edu/forum/#!forum/genome

Thanks,

ChrisL from the UCSC Genome Browser

ADD COMMENTlink written 15 months ago by genecats.ucsc510

Actually the awk script in the thread below does this really nicely with a simple BED file of all refSeq genes.

A: How to get promoter coordinates of hg19 from UCSC genome browser ?

ADD REPLYlink written 15 months ago by rbronste170
0
gravatar for mmmmcandrew
15 months ago by
mmmmcandrew60
mmmmcandrew60 wrote:

In addition to the answers above, another way to do this is with bedtools slop. You'll need the bed file containing the TSS from UCSC table browser, as well as a genome file. This will add the requested amount of base pairs to each end, accounting for chromosome sizes, etc. using the genome file.

ADD COMMENTlink written 15 months ago by mmmmcandrew60
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 657 users visited in the last hour