Question: Extract coordinates of upstream region up to closest coding region in R
0
gravatar for eggrandio
8 months ago by
eggrandio30
eggrandio30 wrote:

Hi,

I am doing an analysis of transcription factor binding and I need to retrieve the promoter region of all the genes in the genome. I can obtain 2kb upstream easily from a TxDB object by:

library(GenomicRanges)
library(TxDb.Athaliana.BioMart.plantsmart28)
gnflanks = flank(genes(TxDb.Athaliana.BioMart.plantsmart28), width=2000)

However, some regions overlap with coding regions upstream (other genes).

I would like to obtain 2kb of the upstream sequence up to the nearest coding region (2kb if no overlaping gene).

I know how to subtract the coding regions from the promoters, but if there is still some "promoter" region upstream of the upstream gene, it will be retained.

Thanks in advance!

sequence R genome • 328 views
ADD COMMENTlink modified 8 months ago by Alex Reynolds29k • written 8 months ago by eggrandio30
2
gravatar for Alex Reynolds
8 months ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

Perhaps the following script (with modifications specific to the genes you want to use) will help.

This bash script combines BEDOPS bedmap, bedops, and gtf2bed with awk processing steps to conditionally process promoter regions, where there are overlaps with gene annotations.

#!/bin/bash                                                                                                                                                                                                                 

GENES=Arabidopsis_thaliana.TAIR10.37.genes.bed

wget -qO- ftp://ftp.ensemblgenomes.org/pub/release-37/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.37.gtf.gz | gunzip -c | awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' | gtf2bed - | awk '($8=="gene")' > ${GENES}                                                                                                                                                                              

awk -vFS="\t" -vOFS="\t" '($6=="+")' ${GENES} > ${GENES}.for
awk -vFS="\t" -vOFS="\t" '($6=="-")' ${GENES} > ${GENES}.rev

TSS=Arabidopsis_thaliana.TAIR10.37.tss.bed

awk -vFS="\t" -vOFS="\t" '{ print $1, ($2-1), $2, $4, $5, $6, $7, $8, $9, $10 }' ${GENES}.for > ${TSS}.for
awk -vFS="\t" -vOFS="\t" '{ print $1, $3, ($3+1), $4, $5, $6, $7, $8, $9, $10 }' ${GENES}.rev > ${TSS}.rev

WINDOW=2000    
PROMOTERS=Arabidopsis_thaliana.TAIR10.37.promoters.bed

bedops --range -${WINDOW}:0 --everything ${TSS}.for > ${PROMOTERS}.for
bedops --range 0:${WINDOW} --everything ${TSS}.rev > ${PROMOTERS}.rev

UPSTREAM_FILTERED_PROMOTERS=Arabidopsis_thaliana.TAIR10.37.promoters.upstreamFiltered.bed

bedmap --count --echo --echo-map-range ${PROMOTERS}.for ${GENES}.for | awk -vFS="|" -vOFS="\t" '{ if ($1==0) { print $2; } else { m=split($2,a,"\t"); tssStart=a[2]; tssEnd=a[3]; n=split($3,b,"\t"); geneEnd=b[3]; if ((geneEnd < tssEnd) && (geneEnd >= tssStart)) { print a[1], geneEnd, tssEnd, a[4], a[5], a[6], a[7], a[8], a[9], a[10]; } } }' > ${UPSTREAM_FILTERED_PROMOTERS}.for
bedmap --count --echo --echo-map-range ${PROMOTERS}.rev ${GENES}.rev | awk -vFS="|" -vOFS="\t" '{ if ($1==0) { print $2; } else { m=split($2,a,"\t"); tssStart=a[2]; tssEnd=a[3]; n=split($3,b,"\t"); geneStart=b[2]; if ((geneStart >= tssStart) && (geneStart < tssEnd)) { print a[1], tssStart, geneStart, a[4], a[5], a[6], a[7], a[8], a[9], a[10]; } } }' > ${UPSTREAM_FILTERED_PROMOTERS}.rev

#                                                                                                                                                                                                                           
# cleanup                                                                                                                                                                                                                   
#                                                                                                                                                                                                                           

bedops --everything ${UPSTREAM_FILTERED_PROMOTERS}.for ${UPSTREAM_FILTERED_PROMOTERS}.rev > ${UPSTREAM_FILTERED_PROMOTERS}
rm ${GENES}.for ${GENES}.rev ${TSS}.for ${TSS}.rev ${PROMOTERS}.for ${PROMOTERS}.rev ${UPSTREAM_FILTERED_PROMOTERS}.for ${UPSTREAM_FILTERED_PROMOTERS}.rev

The input to this pipeline is a BED file called ${GENES}. You could replace the Ensembl A. thaliana annotations with any BED file of genes.

Note: This input BED file comes from a GTF file and so has ten columns. If this input BED file has fewer than ten columns, then the awk statements will need adjustment to report fewer columns.

The output is a set of promoter regions in ${UPSTREAM_FILTERED_PROMOTERS}, where the upstream region of the 2kb promoter is clipped by any overlapping genes, and the gene is extended to the upstream-most edge of the promoter.

The definition of upstream here is strand-specific. So forward- and reverse-stranded annotations will have respective directionality of upstream.

This script breaks everything down by strand and by annotation category. You could comment out the cleanup section if you want to investigate the contents of each step, to verify that this works for your problem.

ADD COMMENTlink modified 8 months ago • written 8 months ago by Alex Reynolds29k
0
gravatar for Alex Reynolds
8 months ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

You could use BEDOPS closest-features to find nearest upstream features and test if they are within range with bedmap --range, filtering them or scanning further upstream, depending on what you want to do next.

ADD COMMENTlink written 8 months ago by Alex Reynolds29k

Thanks for your reply. I know how to remove any coding region from the upstream regions, but the problem is that I only want the region from the transcription start site up to the closest upstream gene, and nothing else upstream. I would need some way to "anchor" the feature to the TSS but I do not know how to do that.

example:

upstream region:

========================(TSS)gene

overlapping gene:

     ======
========================(TSS)gene

If I remove it, I am left with:

=====      =============(TSS)gene

And I want to retrieve only:

           =============
ADD REPLYlink modified 8 months ago • written 8 months ago by eggrandio30

Do you have a situation where a 2kb window upstream of a gene can overlap another 2kb window from another gene?

ADD REPLYlink written 8 months ago by Alex Reynolds29k

Yes, that is my problem. Maybe it's not so common in human genome, but I can find examples very easily in Arabidopsis.

ADD REPLYlink written 8 months ago by eggrandio30

Okay, I provided a solution that should work reasonably generically with any set of annotations, including yours. Feel free to give it a shot if you like.

ADD REPLYlink written 8 months ago by Alex Reynolds29k
Please log in to add an answer.

Help
Access

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 1882 users visited in the last hour