Tracking Sensitive Data with Dataguise and Cloudera Feb 18, 2017
Previously, we announced that the leaders in the data governance space have joined Cloudera to provide a unified foundation for open metadata and end-to-end visibility for governance. Today, we are happy to host this guest blog from Venkat Subramanian, Chief Technology Officer and Subra Ramesh, VP of Engineering of Dataguise.
People often refer to Big Data in the context of 4 Vs: Volume, variety, velocity, and “veracity”. (A good backgrounder on how we got the first three Vs can be found here: http://tinyurl.com/c4l6rhw) Veracity, as the new kid on the block, speaks to the tricky nature of data quality in Hadoop. With statistics such as “bad data can cost businesses up to 12% of their revenue” (source: Experian Data Quality), it’s perhaps for good reason that people talk about data veracity as a key big data challenge.
Cloudera has developed Cloudera Navigator, which seeks to provide end-to-end data governance for Apache Hadoop-based systems. Cloudera Navigator provides a rich set of features that span four key areas: comprehensive and unified auditing across Hadoop, unified and searchable technical and business metadata, lineage, and lifecycle management. With the inclusion of Navigator in Cloudera’s Accelerator Program, Cloudera is providing an open API framework that ensures that metadata from different repositories and systems can be automatically shared and easily searched, viewed and managed.
That’s why Dataguise is excited to be joining the Cloudera Accelerator Program and taking advantage of the APIs in Cloudera Navigator to fuse intelligent and automated sensitive data discovery directly into Navigator.
Using Dataguise’s sophisticated discovery of sensitive data within HDFS, and during ingest via Flume, FTP, and Sqoop, customers can create an interactive reporting system that gives precise details about the location and type of sensitive data across the entire cluster. Now, in a very simple, easy, automated two step process, all files containing sensitive data of any type (credit cards, social security numbers, bank accounts, addresses, names, blood types, etc.) can be automatically detected, counted, and reported from Dataguise, and directly ingested, uploaded, and tracked as smart metadata tagging in Cloudera Navigator.