Number crunching through clustering


Separating your data into buckets is useful in a lot of problems especially fraud detection. How do you mathematically ‘cluster’ your data? One statistical way is the K-means clustering.

Without delving into too much statistics, here is a spreadsheet you can use to do this for your own data.

This sheet accepts pairs of two variables – for example age versus
number of sick leave applied, by a group in a year. Thereafter it
categorizes this data into buckets, the number of buckets being
specified by you. Once the sheet gives you bucket classification, you can analyse it for problems. You should see the following cases as worthy of further attention:

  1. too many datapoints falling in a single group
  2. only one or two datapoints in a single group
  3. any point that does not belong to the group its in (this is only
    possible if the data has a subjective background)

This method can be used to sample data for further analysis wherever
there is simply too much data to analyse. In our example it can be
used to isolate people who may be feigning sickness to take leave. In
a test for people with different levels of capability it can be used
to grade scores etc. It may also be used to solve the needle in haystack problem.


Needle in Haystack Part II – the solution


Okay so we have faced the problem. What do we do?

It may be prudent, to obtain bye-in from the person who will be looking for the needle in the haystack. Especially if this is a boring task you will need to explain to the person why it is necessary to find the needle and how it fits in with the vision/mission.

In case this is a repetitive scenario, you may want to track how many times each resource has failed to find the needle and penalize the one at the top. If this specific scenario is not repetitive you may want to club with similar scenarios where the group overall is repetitive. This can also be subjected to statistical analysis (more on the statistical analysis in another post here).

One way, as I suggested earlier, is to create a verification procedure that is shorter than looking for the needle in the haystack itself. Ask a peer to carry out the verification – call him the “supervisor” (he will be better motivated this way).

Another way to handle this problem is through the usual reward/punishment strategy. Either offer a reward for finding it, or offer a punishment for not finding it. The reward does not have to big, neither does the punishment need to be huge. Token punishments are enough. For example, a task required everyone to fill up a template, and one person to consolidate each Friday. People would need reminders, etc – so we decided that whoever is the defaulter the highest number of times would get the task of consolidating for the next few weeks. Then on to the next defaulter.


Needle in Haystack Part I – the problem

Needle in Haystack
Needle in Haystack?

One of the problems I have faced repeatedly in not just project management but in a wide variety of scenarios (even at home) is what I call the ‘Needle in Haystack’ problem. Say I am the manager of a project where the task is to find a needle in a haystack. I assign the task to Mr.X. He comes back after three days and tells me there is no needle in there. I am confronted with three possibilities:

  1. He actually spent three days actively looking for the needle and was unable to find it. There is indeed no needle in there.
  2. He actually spent three days actively looking for the needle and was unable to find it. There is a needle there, only he was unable to find it (the methods may be not very efficient for example).
  3. Mr.X went on a movie watching spree 🙂

The real problem here however, is that in order to find out which of the scenario is factual, I will have to find the needle myself – and maybe spend three days myself. There is no benefit obtained through delegation.

One course of action, obviously, is to seek evidence from Mr. X. A bus ticket from office to the haystack for example.
Another is to find a way to verify if there is a needle or not, in say 3 hours. In most cases one of these solutions is possible, but is hard to find. In the next few postings on this topic I will share case studies around this, to give you some ideas around it. There is no magic wand though, everyone needs to invest thought.

All postings in this series will be accessible through this link:


Licensing and information about the blog available here.