5 mistakes to avoid when implementing data lakes

Many organizations are making debilitating mistakes that will ultimately hinder their ability to have a scalable, elastic usable platform.

Nov 15 175 min read

Matt Maccaux

Global big data practice lead, Dell EMC Services

Storing big data has always been a challenge for healthcare organizations, but storing it in a way that’s readily accessible and useful has proven to be even more mystifying. Enter “data lakes,” a much-buzzed-about solution for organizations that need a better way to store and work with mass amounts of data and analytics.

Data lakes, and big data technologies like Hadoop, HDFS, Hive and HBase, have quickly grown in popularity because of their ability to host raw data from applications in all forms, often at a smaller cost than enterprise data warehouses. The idea is that organizations can then easily search for the information they need, regardless of source or format, helping them leverage analytics more effectively in their day-to-day business operations.

But data lakes also offer a prime opportunity that too many organizations are missing—the ability to monetize their data. As organizations build out their data lakes without this longer-term goal in mind, they’re making debilitating mistakes that will ultimately hinder their ability to turn their data lake into a scalable, elastic data monetization platform.

There are five common implementation mistakes organizations are making that can affect the long-term business applications for their data lake technology.

Too much Hadoop. When Hadoop distributions or clusters pop up all over enterprises, there is a good chance you’re storing loads of duplicated data. Many enterprises deploy Hadoop little by little, department by department. This creates silos of data, which inhibits big data analytics because employees can’t perform comprehensive analyses using all of the data. This essentially re-creates the data warehouse/mart data proliferation problem data lakes were created to solve—but with more modern technology.

Too much governance. Some organizations take the concept of governance too far by building a data lake with so many restrictions on who can view, access, and work on the data that no one ends up being able to access the lake, rendering the data useless.

Not enough governance. Conversely, some organizations don’t have enough governance over their data lake, meaning they lack proper data stewards, tools, and policies to manage access to the data. If lakes aren’t well-organized and managed, they can quickly accumulate an immense amount of ungoverned, low-quality data. The data can become “polluted” or “tampered with,” and eventually the business stops trusting the data, again, rendering the entire data lake useless.

Inelastic architecture. The most common mistake organizations mistake is building their data lakes with inelastic architecture. Because data storage can be costly, organizations often slowly and organically grow their big data environment one server at a time, often starting out with basic servers but eventually adding high-performance servers to keep up with the demands of the business. Over time, the growth of data storage outpaces the growth of computing needs and maintaining such a large, physical environment becomes cumbersome and problematic.

Pet projects. IT teams often champion the implementation of data lakes as “pet projects,” believing that if they build a data lake, it will push the business to use it. IT teams want to build out a data lake and perform analytics on IT data to prove they can perform analytics on the business’ behalf. But IT use cases are notoriously low-value exercises from a business perspective, and do nothing to build credibility with the business stakeholders.

The obstacles to data monetization using data lakes are larger than just implementation challenges.

There haven’t been any best practices or methodologies in place to help organizations define the potential value of their data so they can invest in the storage and analytic technologies they need to achieve this future. Without a sense of the opportunities around the corner, it’s tough for organizations to see the bigger picture and devote adequate resources to their data lakes.

Dell EMC recently performed research with the University of San Francisco to begin establishing a methodology based upon economic concepts (e.g., multiplier effect, scarcity) and data science techniques. One of the goals of our research was to define the role of the data lake, data governance, data quality and other data management disciplines in managing, protecting and enhancing the organization’s data and analytic assets. We also sought to help organizations define the economic value of their data so they could make better decisions as to where to invest their organization’s precious data and analytic resources.

For the organizations who grasp the opportunities and successfully overcome these obstacles, the “Data Lake Future” awaits. This future is reserved for those who fully embrace the unique characteristics of data and analytics and understand the power of digital assets that never deplete and can be used across an infinite number of use cases at near-zero marginal cost. They will see the data lake as a “collaborative value creation platform” that will drive not only new levels of efficiency, but new data monetization opportunities.

As with any emerging technology, it will take time before data lakes, and therefore the organizations who run them, have reached their full potential. But those who can start the journey now – strategically and with a long-term vision – stand to create an enormous competitive lead that will be difficult to diminish in the years to come.

More for you

Loading data for hdm_tax_topic #better-outcomes...