Off topic:Appropriate Multi-dimensional data clustering
0
0
Entering edit mode
8.3 years ago

I am a data-mining newbie and need some help with a high dimensional data-set (subset is shown below). It actually has many dimensions and several thousand rows. I need to build a best data-set data formation for my project work for online-exam scan clustering similar(Class-rooms) and if any similarity metrics can be calculated from this data and is my data need any normalization process due each value its differ from other.

My case in the following matrix, M , ROWS are represent the class-rooms are numbered from Class-Rome 1 ... Class-Rome 100,1000(N),

COLUMNS are attributes which I have to tested it (contain may 70 to 100 country name based on on-line student nationality registered on our system).

Each intersection carry a particular value refer to number of student from total number of students in that class-Rome have this nationality.

The following image is a part from my data-set matrix

Class-Rome            USA     UK     Germany   ...etc         Australia    Total Student
number

Class-Rome 1          5       10     0         ..             16           50
Class-Rome 2          3       0      13        ..             0            60
Class-Rome 3          0       24     2         ..             14           78

.............etc      ..      ..     ..        ..             ..           33

Class-Rome            18      12     11        ..             0            68
100,000
  • Class-Rome 1 has 5 in USA cell and 10 in USA , ..etc .. which represent there are 10 students from UK (UK nationality) and 5 from USA from the total number of students in that class-Rome (50),
  • Class-Rome 2 has is 23/60 in USA cell and 3 in UK ,etc .. from total number of students in class-Rome 2 are 60 ... etc for remaining fields..
  • Zero's value 0 refer to there are no student from that country.

My questions are: in order to cluster similar Class_Romes based on attributes values hope to look like this ..

For example:

Cluster1=Class-Rome1, Class-Rome11, etc.. this indicates cluster1 is North-America countries

Cluster2=Class-Rome2, Class-Rome6, etc.. this indicate cluster2 is Europe​ countries

ClusterN=etc... Middle-East countries

Q1: How to select a proper clustering algorithm for clustering my data, and (Is my data able to clustering or not?) ..

Q2: Is there necessity to normalize (scale) values to be within particular range in which each Class-Rome has different number of students from other Class-Romes, if yes why?

Q3: What is the best clustering algorithm can clustering similar Class-Romes for this types of data matrix, please any suggestion?

Q4: Is there any other representation can become easy to clustering similar Class_Romes

cluster-analysis data-analysis clustering • 1.3k views
ADD COMMENT
This thread is not open. No new answers may be added
Traffic: 1999 users visited in the last hour
Help About
FAQ
Access RSS
API
Stats

Use of this site constitutes acceptance of our User Agreement and Privacy Policy.

Powered by the version 2.3.6