Question About Transfac Database Data Quality
1
2
Entering edit mode
12.8 years ago
Zhe Liu ▴ 60

Hi guys,

I have some questions in accession number and quality of the data in TransFAC database. I have read the following FAQ from website, but still I have some problems to understand it.

Thus, V$OCT102 indicates the second matrix for vertebral Oct-1 factor. Instead of the consecutive number, those matrices which have been generated from TRANSFAC® SITE entries connected to a certain transcription factor, IDs end up with an abbreviation of the least quality of the sites used to construct the matrix. E. g., V$CREBQ2 is a matrix constructed of CREB binding sites of quality 2 or better. Finally, a matrix with an ID like V$AP1_C has been derived from a "consensus description" constructed with the aid of ConsIndex (Frech et al., Nucleic Acids Res. 21:1655-1664, 1993).

  1. Which is higher quality, accession number ending with bigger number or small number.
  2. One ends with "C" and the other ends with "6", how can I decide which one should be chosen to attain higher quality.
  3. Both accession id with quallity version, but one of them contains a number. Such as: V$E2F1Q3, V$E2F1Q3_01 which one is more reliable.

Or could you provide relevant info about these issues. Thanks a lot!

transcription • 3.6k views
ADD COMMENT
1
Entering edit mode
12.7 years ago
Zhe Liu ▴ 60

Dear all,

I asked the staff from GSEA and got the below email which might give you some hints.

Hi,

Thank you for your interest in MSigDB.

I wonder if I could get the transcription factor name in the gene sets. For example, there are several gene set named

V$FREAC7_01 V$HFH8_01 V$NFAT_Q6 V$MYCMAX_B V$GATA_C TGANTCA_V$AP1_C

I am confused about how I can change them into gene symbol or any other ID.

In general, there is no one to one correspondence between a transcription factor binding sequence and transcription factor gene symbol. Consequently, you might only be able to change that in a very small number of gene sets, and even that might not be easy to do.

This is because: a) transcription factors may consist of several subunits and thus consist of more than one gene symbol (e.g., AP1 is is a heterodimer of proteins belonging to the c-Fos, c-Jun, ATF and JDP families) b) a single transcription factor can bind to more than one sequence motif c) different transcription factors can bind to the same sequence motif

You should keep in mind that these gene sets are made around DNA sequence motifs, so that each set consists of genes whose promoters contains sequences matching a particular motif. Thus, TGANTCA_V$AP1_C gene set is made of genes whose promoters contain sequences matching TGANTCA motif; this TGANTCA motif is similar to a transcription factor binding site documented in TRANSFAC database v.7.4 as V$AP1_C.

In addition, what's the meaning of V, Q, B, C in the gene set names?

These are TRANSFAC motif naming conventions described here:

http://www.gene-regulation.com/pub/databases/transfac/doc/matrix1SM.html

I also wondering if I could download some datasets that have gene symbol or entrez gene ID.

You can download these from here:

GSEA home web site > Downloads

Gene set files using gene symbols contain "symbols" in their names. Gene set files using human Entrez Gene IDs contain "entrez" in their names.

Thanks all.

Zhe Liu

ADD COMMENT

Login before adding your answer.

Traffic: 1956 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6