There continues to be a lot of excitement about data lakes and the possibilities that they offer, particularly about with analytics, data visualizations, AI and machine learning. As such, I’m increasingly being asked whether you really need Data Governance over a data lake. After all, a data lake is a centralized repository that allows you to store all your structured and unstructured data on a scalable basis.
Unlike a data warehouse, in a data lake, you can store your data as-is without having to structure it first. This has resulted in many organizations “dumping” lots of data into data lakes in an uncontrolled and thoughtless manner. The result is what many people are calling “Data Swamps” which have not provided the amazing insights they hoped for.
So the simple answer to the question is yes – you do need Data Governance over data lakes to prevent them from becoming data swamps that users don’t access because they don’t know what data is there, they can’t find it, or they just don’t trust it. If you have Data Governance in place over your data lake, then you and your users can be confident that it contains clean data that can found and used appropriately.
But I don’t expect you to just take my word for it; let’s have a look at some of the reasons why you want to implement Data Governance on data being ingested into your data lake:
Data Owners Are Agreed
Data Owners should be approving whether the data they own is appropriate to be loaded to the Data Lake e.g. is it sensitive data, should it be anonymized before loading?
In addition, users of the data lake need to know who to contact if they have any questions about the data and what it can or can’t be used for.
Whilst data definitions are desirable in all situations, they are even more necessary for data lakes. In the absence of definitions, users of data in more structured databases can use the context of that data to glean some idea of what the data may be. As a data lake is by its nature unstructured, there is no such context.
A lack of data definitions means that users may not be able to find or understand the data, or alternatively use the wrong data for their analysis. A data lake could provide a ready source of data, but a lack of understanding about it means that it can not be used quickly and easily. This means that opportunities are missed and use of the data lake ends up confined to a small number of expert users.
Data Quality Standards
Data Quality Standards enable you to monitor and report on the quality of the data held in the data lake. While you do not always need perfect data when analyzing high volumes, users do need to be aware of the quality of the data. Without standards (and the ability to monitor against them) it will be impossible for users to know whether the data is good enough for their analysis.
Any data cleansing done in an automated manner inside the data lake needs to be agreed with Data Owners and Data Consumers. This is to ensure that all such actions undertook to comply with the definition and standards and that it does not cause the data to be unusable for certain analysis purposes— e.g. defaulting missing date of births to an agreed date could skew an analysis that involved looking at the ages of customers.
Data Quality Issue Resolution
While there may be some cases where automated data cleansing inside the data lake may be appropriate, all identified data quality issues in the data lake should be managed through the existing process to ensure that the most appropriate solution is agreed by the Data Owner and the Data Consumers.
Having data flows documented is always valuable, but in order to meet certain regulatory requirements, (including EU GDPR) organizations need to prove that they know where data is and how it flows throughout their company.
One of the key data governance deliverables is data lineage diagrams. Critical or sensitive data being ingested into the data lake should be documented on data flow diagrams. This will add to the understanding of the Data Consumers by highlighting the source of that data. Such documentation also helps prevent duplicate data from being loaded into the data lake in the future.
I hope I have convinced you that if you want a data lake to support your business decisions, then Data Governance is absolutely critical. Albeit that it may not need to be as granular as the definitions and documentation that you would put in place for a data warehouse, it is needed to ensure that you create and maintain a data lake and not a data swamp!
Ingesting data into data lakes without first understanding that data, is just one of many data governance mistakes that are often made. You can find out the most common mistakes and, more importantly, how to avoid them by downloading my free report here.