Most companies are undergoing a transformation to become “A data company that does [insert their business] “Better than anyone else.” Modern enterprises are not only data-native, digital-native, and cloud-native, but they are finding ways to differentiate themselves through their data and monetize it as an additional revenue stream. Furthermore, the only way to keep up with rapid evolutions in AI and machine learning (ML) will require strategic investments to stabilize the underlying data infrastructure. But what happens when the immense amount of data stored today is not properly managed?
Imagine trying to find a specific book in a library without knowing its location, title, or even who the author is. Oh, and there’s no tool or person to ask, so you go asking anyone else in the library for help in the hopes that they’ll point you in the right direction or just give you a book. Similarly, unmanaged data gets buried in a dark corner of a “library,” but in most cases it no longer resembles the book it once was and the author is unknown. This often happens through data silos, redundant or duplicate platform services, conflicting data definitions and stores, and more — all of which lead to unnecessary cost and complexity.
While the ideal scenario would be to ensure that all data assets are discoverable from the start, there are ways to untangle the chaos once it occurs. But this is something that all companies struggle with. Individual teams often have their own access to infrastructure services, and not all data events (including sharing, copying, exporting, and enrichment) on those platforms are monitored at the enterprise level. Consequently, the challenge persists, expanding, and the data library continues to grow without consistent governance or control.
Dremio Chief Data Officer (CDO).
The cost of data loss
The consequences of not being able to find data can be profound. It can severely impact an organization’s operations and strategic objectives, impair decision-making, compromise operational efficiency, and increase vulnerabilities to regulatory compliance and data breaches. For decision-making, the insights essential to making informed decisions are often described as unreliable or inaccessible.
This lack of visibility and trust leads to delays in identifying and acting on trends and customer needs, and responding quickly to market changes, ultimately hampering competitiveness and agility over time. When data is scattered in unsupervised silos or duplicated across different cloud services without central oversight, it’s like having books in multiple corners of a library without a central catalog system.
Furthermore, the inability to locate and protect sensitive data increases the likelihood of unauthorized access or inadvertent exposure, further exacerbating the risks associated with privacy breaches and intellectual property theft. Ask any engineer or analyst and they will likely tell you the challenge of managing data that can be exported to spreadsheets. Solving the download problem should be harder than knowing what data is on that platform in the first place – at least then you can see that a download occurred and who can help with any subsequent audits.
Correct the error
For organizations that need to correct course, one of the most scalable solutions is to ensure “compliance.” Simply put, this means ensuring that every data-related event (from the delivery of services to the enrichment of data within them) is recorded, monitored, and can be traced. Most importantly, these events are visible to any stakeholder who is responsible for protecting or overseeing the data.
By ensuring that these events are transmitted to a common metadata catalog, for example by publishing and enriching an enterprise catalog, companies can monitor and audit their data more effectively. Ideally, any non-compliant resources should be immediately removed by the company, reducing the chance of data loss or undiscoverability. This way, every user who creates an object store, computing service, etc. is recorded for auditing, events are available for lineage and traceability, and ideally, a path to data provenance.
When they are already lost, tools like BigID act as a sophisticated library catalog, critical to providing a bottom-up view of the ecosystem, helping organizations understand what data is where and what systems are using it. Tools that provide governance and compliance for a data glossary and workflow management and the adoption of patterns like using the Iceberg format will not only enable lower switching costs today and tomorrow, but will also make it easier to integrate the many functional catalogs and platforms across the enterprise. The goal here is to create value quickly while simplifying data management in the future.
Companies need to gain insight into their data landscape, identify potential compliance issues, and take corrective action before data becomes unmanageable, let alone configure a system to better scale. This will always be the responsibility of a central team, or at best shared with functional leaders when fully democratized. To be clear, not all of these tools are required to get started. Rather, understanding the nature of your current state (or starting point) will dictate how to quickly prioritize the use cases to be used to prioritize modernization. You need to balance quick wins and changes with large, fundamental shifts that allow transformations to move faster in the medium term to maintain momentum and continually build trust.
An effective parallel strategy is to also build microservices or bots that scan, audit, and ensure compliance on an ongoing basis. These microservices can perform a variety of functions, from basic compliance checks to full detection of anomalies in asset utilization compared to normal service delivery, roles, and asset usage. By continuously monitoring data events and usage patterns, these microservices can detect anomalies and potential compliance violations in real-time, allowing for quick corrective action. As mentioned above, all data resources and events should be automatically logged upon provisioning, so the bot can immediately delete any data that is not categorized as non-compliant.
The next chapter
Much like a well-organized library where every book is cataloged and easily accessible, a well-managed data environment enables businesses to thrive. Avoiding data chaos requires a proactive and strategic approach to data management that does not create additional friction or processes for users. By implementing compliance as code, leveraging data visibility tools, and building microservices for continuous compliance, businesses can ensure their data assets remain findable, secure, and valuable. With these strategies in place, companies can navigate the complexities of data management and drive sustained growth and innovation.
Finally, it is critical to foster a culture of data governance within the organization. Educating employees on the importance of data governance and establishing clear protocols for handling data can significantly reduce risk for businesses. Regular training sessions and updates on best practices ensure that all team members are aligned with the company’s data governance goals.
We list the best cloud log management services.
This article was produced as part of TechRadarPro's Expert Insights channel, where we showcase the brightest and brightest minds in the tech industry today. The views expressed here are those of the author, and not necessarily those of TechRadarPro or Future plc. If you're interested in contributing, find out more here: