I have the peak file of Active Enhancer mark (H3K27ac). I have decided the 100kb region up and down stream of TSS to link genes to the Enhancer. My peak file is as below:
I want to get a list of genes which are within 100kb region of these peak coordinates. Can anybody guide me how I can do that. I was looking into ChIPseeker package (annotatPeak function) but I was not getting a clue how to do it!
For example if I say first peak the "Enhancer_1" then I want to link all the TSS that lies with in 100 kb region domain with this enhancer, like a column containing all target genes as: Enhancer_1: Gene1, Gene2, Gene3 ... Gene20
> library(Homo.sapiens)# from bioconductor> library(dplyr)# optional, but for the %>% thing# The import function from GRanges would be the way to import BED files,# but in this case doesn't work correctly because there are extra columns># import("example.bed", "BED") > tf = read.delim("example.bed", header=F, stringsAsFactors=F) %>%
dplyr::rename(chr=V1, start=V2, end=V3) %>%
makeGRangesFromDataFrame(., keep.extra.columns=T)# Get the coordinates of all TSS in human genome. The TxDB object is part of Homo.sapiens.# To get only one TSS per gene, use the genes() function instead of transcripts()> human.tss = resize(transcripts(TxDb.Hsapiens.UCSC.hg19.knownGene, columns=c('gene_id', 'tx_id')), width=1, fix='start')# Get TSS within 100000 bp of your tf list:> mergeByOverlaps(human.tss, tf + 100000)#see also subsetByOverlaps
DataFrame with 54 rows and 9 columns
human.tss gene_id tx_id
<GRanges><CharacterList><integer>
1 chr8:145617095-145617095:+ 203054 32948
2 chr8:145660551-145660551:+ 32949
3 chr8:145691738-145691738:+ 90990 32950
4 chr8:145697960-145697960:+ 32951
5 chr8:145715417-145715417:+ 84988 32952
... ... ... ...
50 chr20:33872619-33872619:- 3692 71966
51 chr20:33880225-33880225:- 128876 71967
52 chr20:33880225-33880225:- 128876 71968
53 chr20:33894862-33894862:- 55245 71969
54 chr20:33893057-33893057:- 71979
tf...1e.05 V4 V5 V6 V7
<GRanges><character><integer><character><character>
1 chr8:145601383-145802686 Rank_1 88 . 5.97704 12.09649
2 chr8:145601383-145802686 Rank_1 88 . 5.97704 12.09649
3 chr8:145601383-145802686 Rank_1 88 . 5.97704 12.09649
4 chr8:145601383-145802686 Rank_1 88 . 5.97704 12.09649
5 chr8:145601383-145802686 Rank_1 88 . 5.97704 12.09649
... ... ... ... ... ...
50 chr20:33750820-33952880 Rank_2 78 . 4.68799 10.87448
51 chr20:33750820-33952880 Rank_2 78 . 4.68799 10.87448
52 chr20:33750820-33952880 Rank_2 78 . 4.68799 10.87448
53 chr20:33750820-33952880 Rank_2 78 . 4.68799 10.87448
54 chr20:33750820-33952880 Rank_2 78 . 4.68799 10.87448
EDIT: for some reasons I misread your question at first, so I wrote how to get the closest genes instead. I'll leave this year for legacy.
# Get the closest genes> neighbor.genes = genes(TxDb.Hsapiens.UCSC.hg19.knownGene)[nearest(tf, genes(TxDb.Hsapiens.UCSC.hg19.knownGene))]# Get the identifier of the original TF, so you can know which TF is closest to each gene:> mcols(neighbor.genes)$tf= tf$V4> neighbor.genes
GRanges object with 2 ranges and 2 metadata columns:
seqnames ranges strand | gene_id tf
<Rle><IRanges><Rle>|<character><character>
8928 chr8 [145699115, 145701718] - | 8928 Rank_1
55741 chr20 [ 33703160, 33865960] - | 55741 Rank_2
-------
GRanges object with 2 ranges and 1 metadata column:
seqnames ranges strand | gene_id
<Rle><IRanges><Rle>|<character>
8928 chr8 [145699115, 145701718] - | 8928
55741 chr20 [ 33703160, 33865960] - | 55741
-------
seqinfo: 93 sequences (1 circular) from hg19 genome