The Need For Big Data Governance Collibra Mapr
The Need For Big Data Governance Collibra Mapr
The Need For Big Data Governance Collibra Mapr
These processes also must connect with the right stakeholders. In many cases, the individuals that understand
the meaning and uses of data are not familiar with the technical aspects of its management. They are people
who work in the business units that use this data to provide value. To capture their input, and to provide them
with useful assistance requires an application that is specifically tailored for their needs.
Define up front what aspects of data management are critical to your business
Knowing what you need to govern is a critical part of implementing proper data governance. While all information
probably should be subject to some governance, and should be cataloged so it can be found, there is a subset
of crucial information that should be the focus of any data governance effort. These critical data elements and
their antecedents are the basis of how the organization makes decisions, services customers and reports to
regulators.
®
Data governance doesn’t have to be a new and burdensome initiative. In fact, some organizations might put
aside a formal data governance program because of the perception of inhibitors and overhead. But the fact is
you already have processes in place that act as a foundation for a formal data governance program. These might
be labeled as “workflows” or “business rules,” but those are merely different terms for the same set of practices.
Use these processes as a starting point to build out a strategy that helps you to gain more value from big data.
Indeed most organizations have some recognition of the importance of these activities and may have
systematized them within a particular function or domain. At the same time, these activities often do not
represent a hindrance to gaining business value. For example, essential to the transition to a data driven
organization is the understanding that interesting insights from data come when data is combined in new ways.
The implementation known as a “data lake” necessarily requires processes that allow you to keep the data you
need in a way that eliminates technical barriers, and gives new capabilities to process that data. This flexibility
means that the processes for managing and governing the data can be seamless when applying them across
the entire scope of newly accessible data.
While concepts such as high availability and disaster recovery are often not classified as components of a data
governance strategy, these capabilities are critical to any environment where data is a valuable asset. Therefore,
strategies for data governance must inherently include strategies for high availability and disaster recover. After
all, if a system can’t reliably stay running, then data is devalued along with the associated data governance
strategy.
High availability, or ensuring your system is continuously running at a single data center, can often be a
complicated objective. You should ideally should seek a system that is designed to minimize the administrative
overhead of overcoming failures within a cluster. If a hardware component fails, your team should be able to limit
the response to simply replacing the failed component, not to reconfigure the software to overcome the failure.
Disaster recovery is sometimes overlooked as a critical component of a production environment, mostly because
the likelihood of a site-wide disaster is rare. But for any mission-critical environment, having those safeguards in
place are important. Interestingly, more and more organizations seek global deployments that put copies of data
in geographically-dispersed regions. In these scenarios, usually the primary goal is to reduce access times by
putting data closer to the users. A side benefit of global deployments is the replication of data that also deliver a
disaster recovery configuration. Should any site suffer an outage, the local users can access remote clusters in
the meantime to continue daily operations. However, increasingly we are seeing these types of global
deployments in response to regulatory pressures, such as keeping personally identifiable information in the home
country of the individual. This is a great example of how the governance of the data, and the policies on its
retention and location intersect with the system configuration and the approach to availability and disaster
recovery.
Organizations should also safeguard against data corruption resulting from application or user error. For
example, snapshots capabilities create point-in-time views of data, ensuring a data recovery option should data
corruption occur. Snapshots are also a great way to track data history and lineage by capturing a read-only view
of data at a specific point-in-time that can be traced back during forensic analysis.
®
Data quality dashboards display the results of data quality scans, and give a perspective on whether the quality
of the data is improving or not. They should enable the organization to accumulate values about data that fails
the quality rules to assist in prioritization and improvement.
This combination of rapid and responsive policy management combined with ongoing quality improvement
creates the ideal environment for maintaining your data and keeping it of high quality.
There are three governance activities that are critical to insuring the protection of data in a big data environment.
First there must be some control on the data as it is brought into the environment. This ingestion control is
important to insure that the data can be properly identified. Second, there must be a way to assign appropriate
policies and to develop new policies for security and privacy of the data. These policies need to be explicitly
associated with specific sets of data, and need to be visible to everyone who can use that data. Also, these
policies must be linked to specific enforcement. That is usually using the third element the controls, procedures
and scripts in the data management environment. Each of these things needs to be integrated, so that when
data is in the big data environment, its protection is assured and unambiguous. This limits risk for the
organization.
What is different about big data and how does that affect data
governance?
There are several things about big data that change previous understanding of data governance. Each of these
requires a new approach to governing the data assets effectively.
Creating value by combining data that has not been related before.
Also, the sharing of data is often a process that has not been formalized. The goal of the data lake is to create an
environment where all the data can be easily utilized. This means that the different parts of the organization that
own the data must all agree to provide it, and provision it in an controlled way. In addition, the data can now be
shared with many parts of the organization, often without much effort on their part. This means that data sharing
requirements need to be explicitly negotiated, so that all users of the data understand what they should and
should not do with the data. Also, the scope for semantic mismatches increases, as different parts of the
organization will use the same terms with different meanings. The business glossary and data catalog are crucial
for sorting out which data is the meaningful data for a particular purpose.
More varied and flexible processes.
Instead of the ETL-based up front definitions and policy determination, big data implies a bottom-up ”do it as you
need it” approach to governance. This in turn means that the automation system for that governance needs to
be highly flexible and collaborative, as well as having a clear operating model. This operating model, which takes
into account the entire lifecycle of how data is provisioned, used, changed and retired as well as its quality and
reliability, needs to be automated to deal with the ever increasing amount and variety of data.
®
Maintain availability
When managing big data, you want to maximize uptime while minimizing the effort in ensuring that uptime. Your
underlying big data platform must deliver on these objectives. With the MapR Distribution, you get:
A wide range of important features are required to address your data security requirements. Your data platform
must provide the data-centric controls to ensure a secure environment. With MapR, you get:
• Integration with Kerberos as well as a wide range of user registries via Linux Pluggable Authentication
Modules (PAM)
• Access controls not only at the file level, but at a granular level in MapR-DB, including at the column
family, column, document, and sub-document levels
• Encryption for data in motion, so all network connections between nodes, applications, and services are
protected from eavesdropping
• Comprehensive auditing that logs data accesses as well as administrative and authentication activities
®
Insures that your queries return the right data so the analytical metrics based on that data can be trusted. Data
scientists, owners and users can insure that the correct data values, references and results are used. Using
unstructured data demands efficient coordination between producers, consumers and data scientists, to ensure
all parties are aware of changes that might impact results. Since the changes to this data happen frequently and
often continuously as new uses are found for that data, this is a critical capability. This communication also
reduces time consuming error analysis and resolution; partly because there are few inexplicable errors in the
analysis, and partly because the process of reporting problems and resolving them is automated. This increases
the trust in the analytics, and increase their use, and promotes self service. Collibra Data governance gives you
complete control and visibility into your data, its policies and its attributes.
Data governance lets you know what you have, and find that knowledge in many different way. A big data
environment is not just tables, files and streams.. There are many different types of assets that organizations use
to deliver high performance, predictive analysis, and unique insights. These include analytical models,
map/reduce jobs, queries, visualizations, reports and any artefact that uses the data. Each of these or custom
assets can be easily configured and used in Collibra Data Governance Center. It provides complete visualization
of any type of relationship, including lineage relationships and context. Every one of these capabilities is designed
to make sure that you are using the right data for the right purpose at the right time.
MapR provides the industry's only converged data platform that integrates the power of Hadoop and Spark with
global event streaming, real-time database capabilities, and enterprise storage, enabling customers to harness the
enormous power of their data. Organizations with the most demanding production needs, including sub-second
response for fraud prevention, secure and highly available data-driven insights for better healthcare, petabyte
analysis for threat detection, and integrated operational and analytic processing for improved customer
experiences, run on MapR. A majority of customers achieves payback in fewer than 12 months and realizes
greater than 5X ROI. MapR ensures customer success through world-class professional services and with free
on-demand training that 50,000 developers, data analysts and administrators have used to close the big data
skills gap. Amazon, Cisco, Google, HPE, Microsoft, SAP, and Teradata are part of the worldwide MapR partner
ecosystem. Investors include Google Capital, Lightspeed Venture Partners, Mayfield Fund, NEA, Qualcomm
Ventures and Redpoint Ventures. Connect with MapR on LinkedIn, and Twitter.
®
About Collibra
Collibra Corporation is the industry’s only global data governance provider founded to address data
management from the business stakeholder perspective. Delivered through a cloud-based or on-premise
solution, Collibra is the trusted data authority that provides data stewardship, data governance for the enterprise
business user. Collibra automates data governance and management processes by providing business-focused
applications where collaboration and ease-of-use come first. Collibra’s data governance platform embraces the
new requirements of big data solutions, with automation, machine learning, and the flexibility to govern data
assets from source to visualization. Find out more at http://www.collibra.com/