Number crunching through clustering

Scatter
Scatter

Separating your data into buckets is useful in a lot of problems especially fraud detection. How do you mathematically ‘cluster’ your data? One statistical way is the K-means clustering.

Without delving into too much statistics, here is a spreadsheet you can use to do this for your own data.

This sheet accepts pairs of two variables – for example age versus
number of sick leave applied, by a group in a year. Thereafter it
categorizes this data into buckets, the number of buckets being
specified by you. Once the sheet gives you bucket classification, you can analyse it for problems. You should see the following cases as worthy of further attention:

  1. too many datapoints falling in a single group
  2. only one or two datapoints in a single group
  3. any point that does not belong to the group its in (this is only
    possible if the data has a subjective background)

This method can be used to sample data for further analysis wherever
there is simply too much data to analyse. In our example it can be
used to isolate people who may be feigning sickness to take leave. In
a test for people with different levels of capability it can be used
to grade scores etc. It may also be used to solve the needle in haystack problem.

Share

Licensing and information about the blog available here.