Question

Extract coordinates of upstream region up to closest coding region in R

0

Entering edit mode

5.2 years ago

eggrandio ▴ 40

Hi,

I am doing an analysis of transcription factor binding and I need to retrieve the promoter region of all the genes in the genome. I can obtain 2kb upstream easily from a TxDB object by:

library(GenomicRanges)
library(TxDb.Athaliana.BioMart.plantsmart28)
gnflanks = flank(genes(TxDb.Athaliana.BioMart.plantsmart28), width=2000)

However, some regions overlap with coding regions upstream (other genes).

I would like to obtain 2kb of the upstream sequence up to the nearest coding region (2kb if no overlaping gene).

I know how to subtract the coding regions from the promoters, but if there is still some "promoter" region upstream of the upstream gene, it will be retained.

Thanks in advance!

R sequence genome • 2.0k views

ADD COMMENT • link updated 5.1 years ago by Alex Reynolds 35k • written 5.2 years ago by eggrandio ▴ 40

score 2 · Answer 1 · 2019-03-04

Perhaps the following script (with modifications specific to the genes you want to use) will help.

This bash script combines BEDOPS bedmap, bedops, and gtf2bed with awk processing steps to conditionally process promoter regions, where there are overlaps with gene annotations.

#!/bin/bash                                                                                                                                                                                                                 

GENES=Arabidopsis_thaliana.TAIR10.37.genes.bed

wget -qO- ftp://ftp.ensemblgenomes.org/pub/release-37/plants/gtf/arabidopsis_thaliana/Arabidopsis_thaliana.TAIR10.37.gtf.gz | gunzip -c | awk '{ if ($0 ~ "transcript_id") print $0; else print $0" transcript_id \"\";"; }' | gtf2bed - | awk '($8=="gene")' > ${GENES}                                                                                                                                                                              

awk -vFS="\t" -vOFS="\t" '($6=="+")' ${GENES} > ${GENES}.for
awk -vFS="\t" -vOFS="\t" '($6=="-")' ${GENES} > ${GENES}.rev

TSS=Arabidopsis_thaliana.TAIR10.37.tss.bed

awk -vFS="\t" -vOFS="\t" '{ print $1, ($2-1), $2, $4, $5, $6, $7, $8, $9, $10 }' ${GENES}.for > ${TSS}.for
awk -vFS="\t" -vOFS="\t" '{ print $1, $3, ($3+1), $4, $5, $6, $7, $8, $9, $10 }' ${GENES}.rev > ${TSS}.rev

WINDOW=2000    
PROMOTERS=Arabidopsis_thaliana.TAIR10.37.promoters.bed

bedops --range -${WINDOW}:0 --everything ${TSS}.for > ${PROMOTERS}.for
bedops --range 0:${WINDOW} --everything ${TSS}.rev > ${PROMOTERS}.rev

UPSTREAM_FILTERED_PROMOTERS=Arabidopsis_thaliana.TAIR10.37.promoters.upstreamFiltered.bed

bedmap --count --echo --echo-map-range ${PROMOTERS}.for ${GENES}.for | awk -vFS="|" -vOFS="\t" '{ if ($1==0) { print $2; } else { m=split($2,a,"\t"); tssStart=a[2]; tssEnd=a[3]; n=split($3,b,"\t"); geneEnd=b[3]; if ((geneEnd < tssEnd) && (geneEnd >= tssStart)) { print a[1], geneEnd, tssEnd, a[4], a[5], a[6], a[7], a[8], a[9], a[10]; } } }' > ${UPSTREAM_FILTERED_PROMOTERS}.for
bedmap --count --echo --echo-map-range ${PROMOTERS}.rev ${GENES}.rev | awk -vFS="|" -vOFS="\t" '{ if ($1==0) { print $2; } else { m=split($2,a,"\t"); tssStart=a[2]; tssEnd=a[3]; n=split($3,b,"\t"); geneStart=b[2]; if ((geneStart >= tssStart) && (geneStart < tssEnd)) { print a[1], tssStart, geneStart, a[4], a[5], a[6], a[7], a[8], a[9], a[10]; } } }' > ${UPSTREAM_FILTERED_PROMOTERS}.rev

#                                                                                                                                                                                                                           
# cleanup                                                                                                                                                                                                                   
#                                                                                                                                                                                                                           

bedops --everything ${UPSTREAM_FILTERED_PROMOTERS}.for ${UPSTREAM_FILTERED_PROMOTERS}.rev > ${UPSTREAM_FILTERED_PROMOTERS}
rm ${GENES}.for ${GENES}.rev ${TSS}.for ${TSS}.rev ${PROMOTERS}.for ${PROMOTERS}.rev ${UPSTREAM_FILTERED_PROMOTERS}.for ${UPSTREAM_FILTERED_PROMOTERS}.rev

The input to this pipeline is a BED file called ${GENES}. You could replace the Ensembl A. thaliana annotations with any BED file of genes.

Note: This input BED file comes from a GTF file and so has ten columns. If this input BED file has fewer than ten columns, then the awk statements will need adjustment to report fewer columns.

The output is a set of promoter regions in ${UPSTREAM_FILTERED_PROMOTERS}, where the upstream region of the 2kb promoter is clipped by any overlapping genes, and the gene is extended to the upstream-most edge of the promoter.

The definition of upstream here is strand-specific. So forward- and reverse-stranded annotations will have respective directionality of upstream.

This script breaks everything down by strand and by annotation category. You could comment out the cleanup section if you want to investigate the contents of each step, to verify that this works for your problem.

score 0 · Answer 2 · 2019-02-28

0

Entering edit mode

5.2 years ago

Alex Reynolds 35k

You could use BEDOPS closest-features to find nearest upstream features and test if they are within range with bedmap --range, filtering them or scanning further upstream, depending on what you want to do next.

ADD COMMENT • link 5.2 years ago by Alex Reynolds 35k

0

Entering edit mode

Thanks for your reply. I know how to remove any coding region from the upstream regions, but the problem is that I only want the region from the transcription start site up to the closest upstream gene, and nothing else upstream. I would need some way to "anchor" the feature to the TSS but I do not know how to do that.

example:

upstream region:

========================(TSS)gene

overlapping gene:

     ======
========================(TSS)gene

If I remove it, I am left with:

=====      =============(TSS)gene

And I want to retrieve only:

           =============

ADD REPLY • link 5.2 years ago by eggrandio ▴ 40

0

Entering edit mode

Do you have a situation where a 2kb window upstream of a gene can overlap another 2kb window from another gene?

ADD REPLY • link 5.2 years ago by Alex Reynolds 35k

0

Entering edit mode

Yes, that is my problem. Maybe it's not so common in human genome, but I can find examples very easily in Arabidopsis.

ADD REPLY • link 5.1 years ago by eggrandio ▴ 40

0

Entering edit mode

Okay, I provided a solution that should work reasonably generically with any set of annotations, including yours. Feel free to give it a shot if you like.

ADD REPLY • link 5.1 years ago by Alex Reynolds 35k