Question: identifying which SNPs sit in TFBS (Yeast)
gravatar for grins38
3.9 years ago by
United Kingdom
grins3810 wrote:

i have a set of ~11k SNPs for Saccharomyces cerevisiae, baker's yeast and i would like to identify which ones of these sit in transcription factor binding sites and if they do information on the relevant TFBS.

i've scoured this site and the internet and i couldn't find a downloadable database that would give me locations of all known/verified TFBS for yeast. i've studied the YEASTRACT website thoroughly but didn't find such database to download information about TFBS for all the ORFs/genes in one go.

moreover, when i tried getting the information manually through YEASTRACT search on-line i found results confusing. for example: searching for TF for ORF "YOL166W-A" returns a list of 4 TFs. clicking on one of them, say Sok2p, takes you to another page which says, amongst other things, that the corresponding TFBS is "acMTGCAKg"... what does it mean? (i know the ACTG alphabet but what are the 'a', 'c', 'K' and 'g' symbols?) and what does this tell me about the actual location of the binding site? do i have to BLAST the whole genome to identify it? (shouldn't there be a database with this info for yeast already?) if so, how do i BLAST for symbols like 'K' and 'g'?

i have background in stats but currently work in genetics applications, hence extracting relevant bioinformatics data is very confusing for me on occasions. any help will be appreciated.

ADD COMMENTlink modified 3.9 years ago by Alex Reynolds29k • written 3.9 years ago by grins3810
gravatar for Alex Reynolds
3.9 years ago by
Alex Reynolds29k
Seattle, WA USA
Alex Reynolds29k wrote:

I haven't done yeast before, but if you get the MEME-formatted position weight matrices (PWMs) from YEASTRACT, for instance, for transcription factors of interest, you should be able to use those MEME-formatted PWMs in conjunction with a site prediction tool like FIMO, a part of the MEME toolkit that calls binding sites across your specified (FASTA-formatted) genome at or below your specified level of statistical significance.

SNPs can be in a format called VCF, and FIMO output can be in GFF format. In those cases, you can use vcf2bed and gff2bed in the BEDOPS toolkit to write SNP and FIMO results to sorted BED files. For example:

$ vcf2bed < SNPs.vcf > SNPs.bed
$ gff2bed < TFBSs.gff > TFBSs.bed

Once you have TFBSs in a sorted BED file, and your SNPs in a sorted BED file (however you do this, whether via BEDOPS tools or anything else) you can use set operation tools like BEDOPS bedmap with --echo and --echo-map-* operators to print SNPs that overlap ("map to") TFBSs, e.g.:

$ bedmap --echo --echo-map --delim '\t' TFBSs.bed SNPs.bed > TFBSs_with_overlapping_SNPs.bed

Each line of output is a TF binding site, and any SNPs that overlap that binding site by one or more bases.

You can change the operators to get different information. If you just want a list of unique SNP IDs, for example:

$ bedmap --echo --echo-map-id-uniq --delim '\t' TFBSs.bed SNPs.bed > TFBSs_with_overlapping_SNP_IDs.bed

If you want to customize overlap threshold between a TF binding site and a SNP, you can add overlap parameters. For instance, to ensure a SNP falls entirely within a TFBS, you can add --fraction-map 1:

$ bedmap --echo --echo-map --fraction-map 1 --delim '\t' TFBSs.bed SNPs.bed > TFBSs_with_entirely_contained_SNPs.bed

Some binding sites may not have any overlap with SNPs, which you might not be interested in. You could add --skip-unmapped to just print binding sites with SNP overlaps:

$ bedmap --echo --echo-map-id-uniq --delim '\t' --skip-unmapped TFBSs.bed SNPs.bed > Only_TFBSs_with_overlapping_SNP_IDs.bed


ADD COMMENTlink modified 3.9 years ago • written 3.9 years ago by Alex Reynolds29k

thank you for this! i shall try this in the coming week

ADD REPLYlink written 3.9 years ago by grins3810
Please log in to add an answer.


Use of this site constitutes acceptance of our User Agreement and Privacy Policy.
Powered by Biostar version 2.3.0
Traffic: 950 users visited in the last hour