Entering edit mode
7.2 years ago
salvatore.raieli2
▴
90
Hi everyone,
I have a list of genes that are different expressed between tumor and normal tissue. I have two drugs that block two different transcriptional factor (let's say TF1 and TF2). What I want to do is to find which of these genes have in their promoter a TF binding site for TF1, for TF2 and for both.
For a similar task with a singular gene I did this:
library("TFBSTools")
library("JASPAR2018")
library("Biostrings")
#to retrieve the matrix for a specific tf in thre jaspar database
opts <- list()
opts[["species"]] <- 9606
opts[["name"]] <- "MYC"
PFMatrixList <- getMatrixSet(JASPAR2018, opts)
pwm <- toPWM(PFMatrixList, pseudocounts=0.8) #generation of PWM matrix for MYC
seq1 <- read.delim(...) #the sequence of a gene with the promoter I downloaded from NCBI
subject <- DNAString(seq1) #making as DNA string
#finding the site in the sequence
siteset <- searchSeq(pwm, subject, seqname="seq1", min.score="60%", strand="*")
I have around 100 genes I think that this approach is impraticable.
Thank you in advance
Salvo
First of all, what you aim to find is a transcription factor motif. If a motif is also a binding site needs to be determined by experiment, because not every motif automatically means that a TF will bind there. For example, a motif in heterochromatin (and there a thousands of "unused" motifs for every factor across the genome) is unlikely to ba bound by a TF. You should have a look at FIMO from the
MEME suite. It takes as input a fasta file with sequences, e.g. your promoter sequences, and a motif position frequency matrix, e.g. fromJASPAR, and then scans the sequences for motif occurrence, outputting a GFF file.thank you, I will give a look to FIMO. I am not sure it is what I need. If I understand well I have to provide the FASTA for each gene I want to scan, right? Meaning that I have to manually download 100 genes and their promoters in the FASTA format. With FIMO can I automize this process?
You can pass a multifasta to FIMO, in the form:
Export the list of genomic coordinates to disk, and then use
bedtools getfastato make the multifasta. This can be fed into FIMO, which is fully automated from this point on.If I understood well, I get the genomic coordinates for my list of genes and with bedtools getfasta I obtain the fasta, I take all the fasta in one multifasta that I feed to FIMO, right?
If you feed a BED file with all the coordinates you have (your promoter regions), then bedtools will produce a multifasta file, each line representing one line of coordinates:
This multifa you feed into FIMO.
how I can generate a BED file just for the promoter regions of a list of genes?
By querying a database of promoter regions if available, and if not available, creating one.
For simplicity, you could take a window upstream of your genes, say 250bp.
I was thinking something similar if I am not finding a good database
I think you do not need a database. A promoter is always upstream of the first exon. When you do ATAC-seq (scan for open chromatin), a typical peak from a nucleosome-free region is hardly larger than 200bp. For motif scanning, in order to avoid matches by change, you anyway should limit the size of the regions you scan. I would go for e.g. 50bp downstream and 200bp upstream of the first exon, resulting in the 250bp, and then run the analysis. See if it works out, which I think it should.