Finding the “Needle in the Haystack” is an often used analogy in discussions regarding Big Data and analytics. The problem is that having more data (A bigger Haystack) does not always mean you will find more needles. In fact it may make it harder to find those data nuggets you are searching for. I came across some great discussions/posts in the comments section of an article on “The Register” the other day. The story is “GCHQ mass spying will ‘cost lives in Britain,’ warns ex-NSA tech chief”
Some of the analogies and references used in the comments included
- Setting fire to the haystack and using a magnet to pick up the needles
- One problem was that the heat may damage the needle (corrupt your data)
- The terrorists could change to Carbon Fibre Needles so your magnet will not find them!
- A far more distasteful analogy was that of a septic tank where you are looking for a chunk of chocolate but to find it you have to sample everything. Yuck!
In summary I guess the idea is to add as much “relevant” data as you can into your Data Lake but try not to pollute it with pointless additional data that makes it harder for you to find what you search for. The challenge is then, what is relevant data!