Metadata within the Data Swamp
So I thought it was time to share and collaborate a bit more on my latest book. A brief recap the book is going to be called ‘Draining the Data Swamp’ after input from a number of individuals but specifically Ben Hunnisett. The book is about getting the best out of Big Data whilst leveraging existing/more traditional technologies. It’s not so much about the technology but about the patterns and architecture that will enable success. Some of you will remember that as feedback around the content it was suggested by Shenbhaga Pandian Pandi that a chapter on metadata would be valuable.
I’ve had a go at an initial draft and wanted to share some of the ideas and the structure of the chapter. I would be really keen on people’s thoughts. I have not included all the text as is still very rough (and anyway that’s for the full book) but have shown a few bits and the overall structure.
Introduction
Organisations are struggling to manage their big data estates and are starting to realise that metadata is the key to unlocking the value within big data. IDC refers to the lack of metadata being held about the data held in our big data lake as the ‘Big Data gap’. They argue that Big Data is struggling because data is getting lost. The majority of the data ingested into our data lake is raw data files with little or no definition which means therefore little is known about it.
The mentality of just dumping data into the data lake has been shown to be non-viable, as it compromises trust and ability to produce analysis from the raw data. Organisations are starting to see that data needs to be monitored, sanitised and structured to some degree otherwise chaos will ensure – data swamp. The data needs to have some level of cohesion so that it can be queried and analysed against. The challenge with big data is being able to manage it so that it can be found when needed. We need to be able to: identify, capture, curate, store, search, share, analyse.
Subtopics include:
- What is Metadata
- What’s the Value
- Applying Structure to Big Data: introduces the idea that Big Data still requires structure it just applies it in different ways (Schema-on-read rather than schema-on-write)
Metadata Framework
There will be a section of a framework which covers what metadata should be captured and how. The diagram below gives a sense of this.
The Metadata Repository
Some general text about metadata repositories and approaches and then key topics will be:
- Metadata within the traditional relational world
- Metadata in the Big data world
- Hcatalog: using HCatalog to manage metadata
- other Big Data options
- A metadata repository strategy
So that’s my initial thoughts on this subject. What do you think? What’s missing, etc, etc.