Question

How to query for a specific motif upstream of each of the gene in a list?

2

Entering edit mode

9.9 years ago

datanerd ▴ 520

Hi all,

I have a set of genes (a bunch, around 200) and I would like to know if and how many of them contains a specific binding motif upstream of the gene. What tool/database would be best to use? Or if anyone knows of any such papers please let me know.

Thanks in advance!

Mamta

transcription factor binding motif • 7.0k views

ADD COMMENT • link updated 9.9 years ago by Jason ▴ 920 • written 9.9 years ago by datanerd ▴ 520

Ram · Answer 1 · 2014-05-08

3

Entering edit mode

9.9 years ago

Alex Reynolds 35k

You could do a FIMO scan with (for example) the JASPAR or UniPROBE databases, to give you "hits" or positions for motifs in those databases. (You could also do a FIMO scan against your own database of motifs of interest, assuming you bring your own position frequency matrix file.)

Once you have those positions, you can take a BED file of gene annotations and use BEDOPS bedops and bedmap to look for any hits that map to a region upstream of the gene.

For example, here's one way to look 1000 bases upstream of stranded genes for any overlapping motif hits, returning the gene and a list of unique motif names associated with the gene's upstream region:

$ awk '$6=="+"' genes.bed > genes.for.bed
$ awk '$6=="-"' genes.bed > genes.rev.bed
$ bedops --range -1000:0 --everything genes.for.bed \
    | bedmap --echo --echo-map-id-uniq - motif_hits.bed \
    | bedops --range 1000:0 --everything - \
    > answer.for.bed
$ bedops --range 0:1000 --everything genes.rev.bed \
    | bedmap --echo --echo-map-id-uniq - motif_hits.bed \
    | bedops --range 0:-1000 --everything - \
    > answer.rev.bed
$ bedops --everything answer.*.bed > answer.bed

If you don't know what motifs you're expecting to find, you can extract genomic regions upstream of your stranded genes, get the FASTA sequences with bed2fasta and run that through MEME:

$ awk -v window=1000 ' \
    BEGIN { OFS = "\t"; } \
    { \
        if ($6 == "+") { \
            print $1, ($2 - window), ($2 + 1), $4, $5, $6; \
        } \
        else { \
            print $1, ($3 - 1), ($3 + window), $4, $5, $6; \
        } \
    } \
' genes.bed > upstream_reg.bed
$ bed2fasta.pl upstream_reg.bed /path/to/fasta/seqs > upstream_reg.fa
$ meme upstream_reg.fa <meme_search_options...>

The MEME output can then be used with FIMO as described above to retrieve a more detailed annotation of hits in upstream regions.

ADD COMMENT • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by Alex Reynolds 35k

0

Entering edit mode

Thanks a lot this is very informative :)

I am particularly more interested in just one motif and look them up on the geneset and see if its shared. If Iam not wrong, this would be great to find all the known motifs in the geneset, isn't it?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by datanerd ▴ 520

0

Entering edit mode

If you have data for your own custom motif, you could make a MEME-formatted file from that data and use FIMO to scan over the upstream regions to look for that motif.

ADD REPLY • link 9.9 years ago by Alex Reynolds 35k

0

Entering edit mode

Hi,

So, I am trying to use FIMO scan, I know the consensus sequences for the motif. Do you know how i get MEME formatted input motif file?

Does the PWM come from the sequences i have or there is one already I can use?
What is the appropriate supported database according to you?
For the fast sequence file, is it ok to upload just the upstream region of the genes?

thanks in advance!

ADD REPLY • link updated 4.5 years ago by Ram 43k • written 9.9 years ago by datanerd ▴ 520

0

Entering edit mode

Hello Alex, Can you please guide me with this question: A: Prediction of TF binding sites at genome wide scale

ADD REPLY • link 7.7 years ago by Bioinformatist Newbie ▴ 270

Ram · Answer 2 · 2014-05-08

2

Entering edit mode

9.9 years ago

HG ★ 1.2k

Have a look : http://www.bioconductor.org/help/workflows/gene-regulation-tfbs/

ADD COMMENT • link 9.9 years ago by HG ★ 1.2k

0

Entering edit mode

Thanks for this :) It looks interesting.

Here they have worked with Saccharomyces cerevisiae. Is it good enough for humans? Is there any paper citing it.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by datanerd ▴ 520

0

Entering edit mode

I dont have so much experience about human, but earlier we tried for Enterobacteriaceae

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by HG ★ 1.2k

Ram · Answer 3 · 2014-05-08

2

Entering edit mode

9.9 years ago

Charles Warden 8.2k

If you are asking for a motif shared among your ~200 genes, SCOPE might be useful:

http://genie.dartmouth.edu/scope/

If you already know your motif (and, more specifically, you know that motif is a transcription factor binding site) and you want to check for the presence of that motif, then I think there should probably be some sort of useful tool in TRED:

http://rulai.cshl.edu/cgi-bin/TRED/tred.cgi?process=leftNavBar&menu_clicked=as_menu&as_menu_open=&ul_menu_open=

I also have a few more general links related to transcription factor analysis in this blog post:

http://cdwscience.blogspot.com/2013/03/bioinformatics-101-gene-expression.html

ADD COMMENT • link 9.9 years ago by Charles Warden 8.2k

0

Entering edit mode

Thanks :) TRED seems like something that I would want to try.

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by datanerd ▴ 520

Ram · Answer 4 · 2014-05-08

2

Entering edit mode

9.9 years ago

mikhail.shugay 3.5k

If you have a position weight matrix for your motif you can extract sequences for e.g. 1000 bases upstream of your genes using UCSC genome browser Tables and then profile them with Ugene http://ugene.unipro.ru/documentation/manual/command_line/pwm_search.html. Of course especially in TFBS searching it is better to use several tools and compare the results. It is also quite important to know the background frequency of your motif, so you could say that the motif presence is non-random. It is also important to compare found motif positions with open chromatin for your cell line/tissue, see e.g. DNAse sensitivity tracks at UCSC GB

ADD COMMENT • link 9.9 years ago by mikhail.shugay 3.5k

0

Entering edit mode

Thanks for the additional information to take into account :) This is helpful.

However I am unaware what a position weight matrices (PWM) is, any link to read about it, and its importance in relevance to motif finding?

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by datanerd ▴ 520

0

Entering edit mode

PWM is just the representation of a frequency of a given letter (A,T,G and C) at a given position of motif. E.g. if for example a binding assay has shown that a transcription factor binds to sequences

AAATTGG
AAATTGG
AAACCGG
AAACCGG

Than the PWM would be

A 4 4 4 0 0 0 0
T 0 0 0 2 2 0 0
G 0 0 0 0 0 4 4
C 0 0 0 2 2 0 0

So it tells you about the consensus of motif binding site and the information content of each position in the consensus. See http://en.wikipedia.org/wiki/Position_weight_matrix, and Jaspar as already pointed by Charles in an answer above has a good collection of PWMs for various transcription factor motifs

ADD REPLY • link updated 4.3 years ago by Ram 43k • written 9.9 years ago by mikhail.shugay 3.5k

Ram · Answer 5 · 2014-05-08

FIRE is an excellent program to identify motifs (paper is here). It's very simple to use. Just save a text file of your genes with a group number and it will do the rest, even identify the motif if it is known. It also runs several statistical analyses to measure the robustness of the motif and conservation with similar species. FIRE can identify motifs among a set of clusters or continuous data. In your case since you have just 200 genes, you will need to also include 800 more genes you are interested in or all of the background genes. The reason you need to do this is because FIRE will not accept files with fewer than 1000 identifiers. So you can list of your genes you are interested in, and in the adjacent column (title could be "group") make group equal to 0 (e.g. Gene_A and Gene_B), while the background could be 800+ more genes with group equal to 1 next to them (e.g. Gene_C through Gene_H). FIRE will also tell you if the motif is upstream the gene, downstream the gene, or in the UTR.

Also, just a heads up, you will need to make a login for FIRE, but it's free. And if you wish to run it locally I think that is available too. But it's really easy to use the website. There are also programs on the same website to identify GO terms so that may also be helpful for downstream analysis (see IPAGE).

FIRE_input_Example.txt

gene_name   group
GENE_A      0
GENE_B      0
GENE_C      1
GENE_D      1
GENE_E      1
GENE_F      1
GENE_G      1
GENE_H      1