I am a data-mining newbie and need some help with a high dimensional data-set (subset is shown below). It actually has many dimensions and several thousand rows. I need to build a best data-set data formation for my project work for online-exam scan clustering similar(Class-rooms) and if any similarity metrics can be calculated from this data and is my data need any normalization process due each value its differ from other.
My case in the following matrix, M , ROWS are represent the class-rooms are numbered from Class-Rome 1 ... Class-Rome 100,1000(N),
COLUMNS are attributes which I have to tested it (contain may 70 to 100 country name based on on-line student nationality registered on our system).
Each intersection carry a particular value refer to number of student from total number of students in that class-Rome have this nationality.
The following image is a part from my data-set matrix
Class-Rome USA UK Germany ...etc Australia Total Student
number
Class-Rome 1 5 10 0 .. 16 50
Class-Rome 2 3 0 13 .. 0 60
Class-Rome 3 0 24 2 .. 14 78
.............etc .. .. .. .. .. 33
Class-Rome 18 12 11 .. 0 68
100,000
- Class-Rome 1 has 5 in USA cell and 10 in USA , ..etc .. which represent there are 10 students from UK (UK nationality) and 5 from USA from the total number of students in that class-Rome (50),
- Class-Rome 2 has is 23/60 in USA cell and 3 in UK ,etc .. from total number of students in class-Rome 2 are 60 ... etc for remaining fields..
- Zero's value 0 refer to there are no student from that country.
My questions are: in order to cluster similar Class_Romes based on attributes values hope to look like this ..
For example:
Cluster1=Class-Rome1, Class-Rome11, etc.. this indicates cluster1 is North-America countries
Cluster2=Class-Rome2, Class-Rome6, etc.. this indicate cluster2 is Europe countries
ClusterN=etc... Middle-East countries
Q1: How to select a proper clustering algorithm for clustering my data, and (Is my data able to clustering or not?) ..
Q2: Is there necessity to normalize (scale) values to be within particular range in which each Class-Rome has different number of students from other Class-Romes, if yes why?
Q3: What is the best clustering algorithm can clustering similar Class-Romes for this types of data matrix, please any suggestion?
Q4: Is there any other representation can become easy to clustering similar Class_Romes