How to assess the role of governance in data lakes and warehouses
Data lakes and data warehouses are both used to store data. And while they have innate differences and serve organizations differently, data governance is a universal thread that runs through both—and it without which, they would be rendered useless.
Data lakes are repositories that can be structured or unstructured and can contain traditional transaction-type data, phone logs or any other form of information. They are truly a repository of all types of organizational data.
With data lakes, data can be brought in quickly, without complex provisioning, and there is no time spent on how it relates or should interact with other data sitting in the lake. It should be kept as close to its raw form as possible so that it can be used in multiple functions and isn’t locked into a particular use. Because all data is available, it enables much deeper forms of analytics.
Data lakes allow more flexibility for what-if analysis and modeling to identify relationships and likely outcomes that may not have been as obvious, such as with market-basket analysis. With data scientists able to quickly access more information to identify such obscure relationships, organizations can use that information to better service customers.
At the same time, they enable identification of negative indicators, which can help to protect an organization, and identify risks early on so they can be mitigated.
A key example of this comes from a regulatory perspective. A key regulatory metric for reporting is probability of default—in which models are built to calculate the probability of default for different classifications of customers, whether based on geographic location, credit limit or other factors.
With a wide range of factors used in the model, data lakes can provide access to more data more quickly, greatly increasing the accuracy of the models. This enables organizations to better serve their clients and provides them the insight to possible risks early on so that they can be mitigated.
Data warehouses are structured data sets that include both current and historical data. They are structured in a manner to meet reporting or analytical requirements. Creating a “single source of truth” for multiple reporting and analytical requirements reduces risk of inconsistent and inaccurate reporting across the enterprise.
Data warehouses bring data together in a structured way—it is modeled and set up in physical structures via a set of requirements, with performance and capture of consistent data relationships being the key goal.
Data warehouses are used to consolidate the source of data, enabling everything to run into the same tables via a common set of domains/definitions. There can be one or 20 sources, but it will all be presented for use under a set of business-defined and understood domains for organizational purposes.
Having data well organized and consistently aggregated enables the creation of performance and operational metrics—reporting that drives business and enables leaders to make informed decisions. Inclusion of both historical and current information—organized in a consistent manner within the data warehouse—increases the quality of the viewed data, thus increasing decision-making quality.
A key example of this can be seen in seasonality. Operational metrics pulled from data warehouses can help identify times of the year that see more activity than others.
This historical analysis can guide staffing needs and what information is given to merchants, as well as indicate that customer should know this is a higher activity time. It can also impact IT decisioning—new systems shouldn’t be implemented during times when heavier patient censuses are expected. The metrics identified from data warehouse information can impact decisions across the entire organization.
Although they are different, the key to successful data lakes and data warehouses with useful, quality data, is the same—governance. Data governance enables the understanding of not only what is stored where and its source, but the relative quality of the data and being able to ascertain it consistently.
Aside from clarity and structure, governance also enables control. With such control, the organization knows how the data is being used and whether or not it’s meeting its intended purpose.
Say the data has been manipulated to meet a set of determined requirements—without data governance, someone else could come along and pull the data, not knowing it had been previously employed, thus resulting in an inaccurate data analysis.
Essentially, governance is the key to maintaining transparency over what data is available, how data is available, what data should be used and who should or should not be using it. It serves as the glue ensuring both data stores are being utilized appropriately.
Whether or not an organization employs a data lake, data warehouse or both, it’s imperative that the data are governed appropriately. While both data stores provide beneficial insights that can help lead an organization, without a data governance framework to control and guide the two, the wealth of data supported by both may never live up to the transformative potential they carry.