I'm an undergraduate who's just getting introduced to bioinformatics so please bear with me.
I'm trying to find correlations between the number and length of invariant regions in the Anopheles gambiae genome with their function to find any interesting links. I have a list of all the genes in An. gambiae with all their GO terms and I have a list of genes with invariant regions (IR) in (much smaller than the complete genome), information on IR lengths, sequence etc. and their GO Terms as well. I want to cluster the genes in both lists based on their GO terms and find which "clusters" have longer or a greater number of IRs along with making sure this is not just because a functional group is more represented within the genome (hence needing to look at the whole genome also.) Basically I'm just trying to make the stats sound.
I have never used R although I have downloaded it, I have tried DAVID but I don't really understand the output or how to use it in my analysis, I've tried to use GoSlim to get general GO terms per gene but can't seem to get it to work as my files are too big etc. I've got GiTools running a hierachical analysis but it's taking a long time and I'm not sure I set it up correctly either.
Is their any other way you can get a basic and broad description of a gene's function without using GO terms? The problem I'm mainly having is that all genes have multiple terms so I can't categorise them for analysis, hence clustering, hence getting very confused and frustrated!
Any help would be dearly appreciated. I'm doing an internship this Summer and as you may be able to tell I am out of my depth! Put as simply as possible would be much appreciated. Thank you very much.