Find Transcription factor binding site for a gene list
1
0
Entering edit mode
4.0 years ago

Hi everyone,

I have a list of genes that are different expressed between tumor and normal tissue. I have two drugs that block two different transcriptional factor (let's say TF1 and TF2). What I want to do is to find which of these genes have in their promoter a TF binding site for TF1, for TF2 and for both.

For a similar task with a singular gene I did this:

library("TFBSTools")
library("JASPAR2018")
library("Biostrings")
#to retrieve the matrix for a specific tf in thre jaspar database
opts <- list()
opts[["species"]] <- 9606
opts[["name"]] <- "MYC"
PFMatrixList <- getMatrixSet(JASPAR2018, opts)
pwm <- toPWM(PFMatrixList, pseudocounts=0.8) #generation of PWM matrix for MYC
subject <- DNAString(seq1) #making as DNA string
#finding the site in the sequence
siteset <- searchSeq(pwm, subject, seqname="seq1", min.score="60%", strand="*")


I have around 100 genes I think that this approach is impraticable.

Salvo

R TFBS promoter • 3.3k views
1
Entering edit mode

First of all, what you aim to find is a transcription factor motif. If a motif is also a binding site needs to be determined by experiment, because not every motif automatically means that a TF will bind there. For example, a motif in heterochromatin (and there a thousands of "unused" motifs for every factor across the genome) is unlikely to ba bound by a TF. You should have a look at FIMO from the MEME suite. It takes as input a fasta file with sequences, e.g. your promoter sequences, and a motif position frequency matrix, e.g. from JASPAR, and then scans the sequences for motif occurrence, outputting a GFF file.

0
Entering edit mode

thank you, I will give a look to FIMO. I am not sure it is what I need. If I understand well I have to provide the FASTA for each gene I want to scan, right? Meaning that I have to manually download 100 genes and their promoters in the FASTA format. With FIMO can I automize this process?

1
Entering edit mode

You can pass a multifasta to FIMO, in the form:

>chr1:1-100
ATAGCTACG(...)
>chr2:1:1000
ATGGACTA(...)


Export the list of genomic coordinates to disk, and then use bedtools getfasta to make the multifasta. This can be fed into FIMO, which is fully automated from this point on.

0
Entering edit mode

If I understood well, I get the genomic coordinates for my list of genes and with bedtools getfasta I obtain the fasta, I take all the fasta in one multifasta that I feed to FIMO, right?

1
Entering edit mode

If you feed a BED file with all the coordinates you have (your promoter regions), then bedtools will produce a multifasta file, each line representing one line of coordinates:

$cat test.bed chr1 100000 100010 chr2 200000 200020$ bedtools getfasta -fi hg38_noALT_withDecoy.fa -bed test.bed
>chr1:100000-100010
ACTAAGCACA
>chr2:200000-200020
GTCTTAATATATACATAGGT


This multifa you feed into FIMO.

0
Entering edit mode

how I can generate a BED file just for the promoter regions of a list of genes?

1
Entering edit mode

By querying a database of promoter regions if available, and if not available, creating one.

1
Entering edit mode

For simplicity, you could take a window upstream of your genes, say 250bp.

0
Entering edit mode

I was thinking something similar if I am not finding a good database

0
Entering edit mode

I think you do not need a database. A promoter is always upstream of the first exon. When you do ATAC-seq (scan for open chromatin), a typical peak from a nucleosome-free region is hardly larger than 200bp. For motif scanning, in order to avoid matches by change, you anyway should limit the size of the regions you scan. I would go for e.g. 50bp downstream and 200bp upstream of the first exon, resulting in the 250bp, and then run the analysis. See if it works out, which I think it should.

1
Entering edit mode
3.9 years ago

There are some databases as well:

1. TF2DNA [Allows batch search]
2. CISBP [requires sequences of promoters]

Cheers !!