How strong governance differentiates data lakes from swamps

Without precautions, data lakes might serve as junk drawers in which organizations dump data with intentions of putting it in its proper context later.

Aug 07 185 min read

Stijn Christiaens

Chief technology officer, Collibra

A recent Forrester report finds that from 60 percent to 73 percent of all enterprise data goes unused for analytics. This statistic highlights one of the biggest challenges experienced by data scientists and organizations hoping to gain insight from their data.

As the volume of data increases, tapping its value and generating accurate reports has become a Herculean effort. Considering the many data initiatives healthcare organizations have in place, and the significant investments made, coming up short in data discovery and analytics represents a huge missed opportunity.

Familiar hurdles organizations face when using data for analytics include:

Data that can’t be found.
Data that, once found, makes no sense or isn’t trusted.
Conflicting definitions of data that make finding the “right” data impossible.

For organizations to effectively leverage data to differentiate products and services, improve decision-making and maintain competitive advantage, they need a comprehensive, enterprise-wide data strategy—one that ensures data becomes a valuable asset.

In recent years, data lakes have emerged as a viable solution to store massive amounts of data cost effectively. A data lake is centralized repository that can store an enormous amount of raw data, enabling different users to analyze it and gain actionable insight. However, despite their promise, many lakes are overflowing and organizations are struggling to operationalize this data.

Data lakes have massive scale and tremendous flexibility. They accommodate vast amounts of structured and unstructured data. And getting data into a lake is simple.

These very attributes, however, contribute to making it easy to lose track of what’s in the lake. In the rush to aggregate data somewhere, the lakes often serve as data junk drawers—a place where data is dumped for the moment, with the best intention to put it in its proper context later.

This isn’t surprising. In 2014, Gartner warned that data lakes (without the right level of governance) would be nothing more than disconnected data pools. A data lake requires a set of processes and policies around how data is collected, defined and secured. Without this kind of framework, it’s impossible to know what data is in the lake, where it came from, who owns it and its overall value to the organization.

Governance creates transparency across the organization, answering critical questions regarding the data lake, such as:

What’s in the data lake, and what should be in the data lake.
Where the data comes from, and where it’s been.
Who has access to the data.
Who’s using the data and how.

A good data governance framework combined with a data catalog can keep a data lake pristine by cleaning up the disorderly swamp of data. A data catalog offers a single source of intelligence for data experts and other data users who need quick access to their data. Users can tag, document and annotate data sets in the data catalog, continuously enriching the data and increasing the value of existing data assets while also eliminating data silos.

A data catalog enables users to collaborate to understand the data’s meaning and use, to determine which data is fit for what purpose, and which is unusable, incomplete, or irrelevant. It provides a way for every user to find data, understand what it means, and trust that it’s correct.

Organizations now are either building a brand new lake, or cleaning up an existing data lake. Whether an organization has inherited a swamp or are just starting out and want to keep the data lake pristine, establishing a set of policy-driven processes can help it avoid these four common data lake problems.

Data without context. A data catalog helps users understand the data they find by providing information about that data, including its origin, format and use as well as its relationship to other data.

Data that can’t be found. A data catalog organizes and structures data to help people find the information they need to solve problems.

Data that can’t be trusted. A data catalog can help data users find the best data for their purposes, understand the quality of that data, and know whether it’s appropriate to join data from disparate sources.

Data that can’t be shared. A data catalog makes it easier for people to work collaboratively with transparency and trust, enriching data sets and driving value across—and beyond—the enterprise.

Without question, big data is big business. But it’s not about how much data is the lake, but rather how the organization uses that data.

To realize the potential of data lakes, organizations must take appropriate steps to ensure these lakes don’t turn into swamps. This requires a governance framework that enables organizations to establish control over the data dumped into their lakes.

Governance empowers data users, helping them to find, understand and trust their data to improve decision-making and drive innovation.

More for you

Loading data for hdm_tax_topic #better-outcomes...