Enrichment analysis: Which tool I should trust?
3
1
Entering edit mode
7.2 years ago
agrandiadil ▴ 10

Apparently as everybody here, I am a beginner in bioinformatics but I have been entitled with the task of perform an Enrichment analysis. But I am facing troubles trying to obtain the same data among the different databases.

The main thing: We have many gene lists, and we want to know which ones are regulated by the transcription factor MEF2A or MEF2C. Also, we would like to know the enrichment percentage compared to the genome (Ex: If MEF2 regulates the 25% of the entire genome, and in our list it regulates the 50% of the genes, the enrichment is 100%).

So my approach has been: Find which genes in our list are regulated by MEF2A and MEF2C and find which genes are regulated by them in the entire genome.

I have been doing a bit of testing and searching databases and software, and my problem comes when the data among databases are considerably different. In the UCSC browser I can see that LDLR actually has MEF2 binding sites (but I cannot do the manual query for every gene in the list using the USCS browser) but is too hard to find in the other databases that LDLR has the TFBS.

List of different databases for prediction of TFBS with a little explanation of each database: https://abc.med.cornell.edu/education/introtobio/t-promoter.html

Done a search at researchgate obtaining similar results: https://www.researchgate.net/post/How_to_find_the_binding_sites_of_transcription_factors10

Checked this one: http://www.sabiosciences.com/chipqpcrsearch.php?gene=&species_id=0&factor=MEF-2A&ninfo=n&ngene=y&nfactor=n

Or Biostars: Determine Transcription Factors For Genes

And many others sources(enrichr, Harmonizome...), and software as iRegulon (in Cytoscape).

What can I do?

enrichment tf annotation • 4.3k views
ADD COMMENT
0
Entering edit mode

ENCODE resource seems a good starting point (http://hgdownload.cse.ucsc.edu/goldenpath/hg19/encodeDCC/wgEncodeRegTfbsClustered/; try wgEncodeRegTfbsClusteredV3.bed.gz ). This file contains TFBS in BED format. You can grep MEF2 sites and get nearest genes with an short script. Several MEF2 sites next to LDLR TSS are indicated in this file.

ADD REPLY
2
Entering edit mode
7.2 years ago

The only way you can get an idea of which genes are (directly/indirectly) regulated by your genes is by doing a knock-down/out experiment and compare to wildtype (using fx RNA-seq data).

There is no way of bioinformatically predict this (although we can guess - see below) because:

  • Most transcription factors does not bind to a (variant of a) consensus binding site even when the chromatin is open - they require co-factors
  • The regulation of genes are never straight forward - a factor binding close to one gene might actually regulate another gene.

That said we can extract a gene set which is of particular interest.

  • You can get an interaction network around your target genes via STRING which probably is the best database for protein-protein interactions.
  • You can identify a gene set by look at co-expression analysis of lage datasets such as the GTEx data.

I would suggest you use either of these approaches to identify a gene set and do the enrichment analysis on that instead of the entire genome.

Hope this helps

ADD COMMENT
2
Entering edit mode
7.2 years ago
EagleEye 7.6k

I assume you are looking for a Enrichment tools which uses transcription factor database and predict whether genes in your list are enriched for particular factor. If it is so, I would recommend using Enrichr which also performs enrichment analysis using TF databases.

Edited: If you are asking trustworthy tool for gene enrichment analysis (for Biological process and pathways), I will suggest you to select tools having current/updated database. In that case I recommend GeneSCF.

ADD COMMENT
0
Entering edit mode

Thank you Santhilal! I already done the Enrichr test but I repeated it in order to show you where I am finding troubles:

Among databases they give me different results: http://amp.pharm.mssm.edu/Enrichr/enrich?dataset=2mivb

I do understand that there are ChIP, motif and consensus (and others) but They differ a lot between them! There is no database that shows me a common TF, so that makes me wonder if the data obtained is really useful or not.

[In this case, I used one of the databases that I already know that some genes bind to MEF2 like LDLR which doesn't appear there]

ADD REPLY
1
Entering edit mode

Did you explore ChIPBase ? You might find it useful for your purpose.

Good luck.

ADD REPLY
1
Entering edit mode
7.2 years ago
ivivek_ngs ★ 5.2k

If you want to understand if your gene lists are under the regulation of any TF or not then you can try to run motif enrichment analysis to find which TFs will bind those genes with significance. Now once you extract the list of those upstream TFs that significantly bind your gene promoter sequences then try to see the genes of your interest in that list. However, you cannot be sure unless you will be able to perform a knockdown experiment or a TF ChiP for that specific TF on your datasets. Enrichr is a good start. Encode data gives you a flavor. Try tools like HOMER, Pscan or iregulon. They work based on validated TFs. However, these are in-silico predictions. You have to also be specific of what kind of data generated these gene lists. Anyways with motif enrichment, you can find TFBS and targets of upstream TF on your gene lists. If for your data you find hits for a significant gene of interest you looking for just download the list it's binding with. Different databases and tools have different versions or underlying TF models or let's say validated TFs that are used and even a different statistical framework. So the results will never be all the same. Good luck!

ADD COMMENT
0
Entering edit mode

I will try to go step by step. Starting with Pscan:

In Pscan I choose -900 +50 Jaspar 2016 (because I suppose that is the most accurate). And I obtain the results:

-The table I obtain at the right shows us each TF with a p-value. I assume that that p-value is "how significant the enrichment is?".

-The borderline to determine if it's really enriched or not is the Bonferroni p-value, is that correct?

-DAVID Gene Accession Conversion Tool gives me more than one RefSeq par Official gene symbol. I suppose that those are different splicings, paralogs and orthologs; I only take one RefSeq without any criteria, is that correct?

ADD REPLY
0
Entering edit mode

I would reply in pieces to your queries:

In Pscan I choose -900 +50 Jaspar 2016 (because I suppose that is the most accurate). And I obtain the results:

I usually use the default one that is provided by pscan settings, you can use your custom and the default and then come with a stats that give the best possible output and also pretty in line with the biological question. The idea is promoter sequences so the default is the most stringent but you can always relax if you deem it fit but if something is concerning your species or specific doubts about region specificity please ask the group that is maintaining the tool, they should be responsive enough.

-The table I obtain at the right shows us each TF with a p-value. I assume that that p-value is "how significant the enrichment is?".

yes the table on the right gives p-values of enrichment.

-The borderline to determine if it's really enriched or not is the Bonferroni p-value, is that correct?

Definitely, if Bonferroni p-value is significant then your association and significance are far more correct however not always it works so you can always take the top 10 from the pvalue list and work with them if Bonferroni p-value is not significant for the TF list that you get post pscan motif enrichment.

-DAVID Gene Accession Conversion Tool gives me more than one RefSeq par Official gene symbol. I suppose that those are different splicings, paralogs and orthologs; I only take one RefSeq without any criteria, is that correct?

Yes, this is something I would underline and say that when you use refseq you have to keep in mind that for each gene there will be multiple transcripts based on splicings, however, the promoter boundaries should not be that different if you look in UCSC browser. The best way to work with Pscan would be having a differential list of transcripts rather than genes to really capture which transcripts are actually associated as only one or few of the transcripts from the same gene might be impacted and not all.

Below information should be only considered if your gene lists are coming from RNA-Seq and differential expression analysis gene lists from RNASeq data.

When you are doing DE testing a gene level the summarization happens for the transcripts if you perform DE and output list is HGNC gene symbols. So I would in any case advice to discuss this piece of information in your lab whether they prefer actual transcripts that are impacted or gene to the transcript ID converted via DAVID. One workaround I would suggest is HOMER tool which works with gene list for MEA finding upstream TFs on your HGNC gene lists. If I correctly recall the assumption is the promoter boundaries are conserved across the different transcripts for the same gene, it is just that transcriptional readout or a hack based on DE might not be always same throughout all the transcripts for a particular gene especially when you might have differential splicing events. If your data and biological model do not address such events direct conversion should still be fine. However, a discussion with your group might be good in case there are events if differential spliced transcripts are mostly associated with your study then you might rather obtain the list of genes as refseq transcript ids and not HGNC gene ID.

Hope the above piece of information is suitable for you.

ADD REPLY

Login before adding your answer.

Traffic: 1285 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6