Entering edit mode

7.1 years ago

bionovice
•
0

Hi guys,

I know this question has been asked several times but I have not managed to find anything that has worked for me.

I would appreciate any help in this regard!

Essentially I need to check for over representation of TFBS in sequences. I've got the TF name and frequency and I was wondering how I could go about doing this in SPSS?

I have tried the chi-square and fishers test but the results dont seem to make much sense to me. I've got over a 150 TF and their frequency.

Cheers!

Did you also count the frequencies if you scramble each of your data set sequences?

What doesn't make sense with the results? What exactly are you comparing the frequencies against and how did you derive those values?

I derived the frequencies using perl and EMBOSS routines.

I have three datasets - control and 2 exposure datasets with frequency of TFs in all three.

I am new to SPSS and I do not necessarily understand how I go about it to obtain a result.

If you're not already familiar with SPSS, then don't bother using it. The standard tool within bioinformatics is R, which is free and you can often find people to help you with here. Having said that, you can likely get some help with using either of these on cross validated (the statistics stackoverflow).

Hi Devon,

Thanks for your help.

Unfortunately it seems to me like my dataset is not perfect.

Apart from frequency or hits per transcription factor motif what other data would i require to check for over representation?

You need a good number for the expected number that you should see if there is no over-representation. How best to do this will depend on how the regions you're using were derived to begin with. For example, if this is ChIPseq data, then you can't just count the number of occurences genome-wide, since your ability to map (and, therefore, call peaks) isn't uniform across the genome. This sort of thing ends up being somewhat non-trivial. Also, as seidel pointed out, checking a scrambled motif might also be useful. Then you're at least checking for the same base composition, which is a bit simpler.

I think the scrambling is a good idea, thanks guys!

How exactly would I go about this? Does it just involve me randomising my bases in my sequences, running a tfscan, checking for number of hits and then using that data as expected value in my fishers test?

I'm sorry I'm just new to the field and I have been thrown in the deep, hence the confusion.