## Number crunching through clustering

Separating your data into buckets is useful in a lot of problems especially fraud detection. How do you mathematically ‘cluster’ your data? One statistical way is the

**K-means clustering**.

Without delving into too much statistics, here is a spreadsheet you can use to do this for your own data.

This sheet accepts pairs of two variables – for example age versus

number of sick leave applied, by a group in a year. Thereafter it

categorizes this data into buckets, the number of buckets being

specified by you. Once the sheet gives you bucket classification, you can analyse it for problems. You should see the following cases as worthy of further attention:

- too many datapoints falling in a single group
- only one or two datapoints in a single group
- any point that does not belong to the group its in (this is only

possible if the data has a subjective background)

This method can be used to sample data for further analysis wherever

there is simply too much data to analyse. In our example it can be

used to isolate people who may be feigning sickness to take leave. In

a test for people with different levels of capability it can be used

to grade scores etc. It may also be used to solve the needle in haystack problem.

―――――――――――X――――――――――

[…] In case this is a repetitive scenario, you may want to track how many times each resource has failed to find the needle and penalize the one at the top. If this specific scenario is not repetitive you may want to club with similar scenarios where the group overall is repetitive. This can also be subjected to statistical analysis (more on the statistical analysis in another post here). […]