The data lake is a place that we can store all the organisations data without the restrictions imposed by more traditional data warehouse architectures. Instead of storing our data in a highly structured database, it gets stored in the lake in its original format. This removes the expense of the transformations associated with the alternative approach and the data is immediately made available for everyone in the organisation to use (assuming they have access obviously). But the big danger is that the data lake becomes a data swamp!
It’s a kind of Magic
Some data lakes advocates make them sound like the proverbial silver bullet to the decades old problems of delivering analytics. They believe that there is no need to define a data structure upfront in the lake. Instead they take the view that technology allows this responsibility to be pushed to the extraction stage (ie the business). This means that:
- The data can be made available by IT at dramatically cheaper prices.
- The challenge of data silos can be solved as ALL and ANY data, regardless of how disparate it is, can be stored in the lake without the need to understand what it is or what the data is used for.
- The big data challenge of analysing vast volumes of data to gain new insights becomes possible.
Is this really a silver bullet?
Not really and the reasons should be familiar to traditional data warehouse practitioners.
- IT can deliver the data cheaply because it no longer spends time understanding the data and how it is used, instead it simply dump it into the data lake.
- This has the consequence that the responsibility of sorting through the raw data to create meaningful pieces of insight is left to the business user. There is a degree of technical skills required to extract the data and this has led to the rise of business users (data scientists) who are highly skilled at data manipulation.
- Co-locating different data silo’s into one humongous data store doesn’t actually get around the problem. All you have achieved is that the data is in the same physical database. Its still a nightmare to extract joined up data across these multiple data sets because the data hasn’t been structured in a way that enables it to work together.
Before long the data lake will become a data swamp. It will fill up with dirty data both from the perspective of quality, providence, the level of overlap and lack of consistency. In effect the data lake will become harder and harder to use.
Avoiding the Data Swamp
To avoid our data lake becoming an data swamp require a curated repository that has governance, consistency and access controls just as you would expect in a traditional data warehouse. So some of the tricks to adopt are:
- Meta Data Tagging of all data – without descriptive metadata and a mechanism to maintain it, users will not know where the data came from (lineage) and what quality it is. This means each piece of analysis would require the user to start from scratch each time.
- Structure the data – one option is to make use of the conformed dimensional concept as this will add both structure and consistency to the data whilst also providing some performance benefits (see my recent post on this subject http://koios-associates.com/dimensions-big-data)
- Data governance – providing a degree of control on what data goes in the lake and how it is added.