FAQs
Dataguise delivers a unique approach to protecting sensitive data assets in Hadoop. We combine intelligent data detection with automated, data-centric encryption, masking, and audit tools for Flume, Sqoop, FTP, MapReduce, Hive, and SPARK.
We help our customers achieve two increasingly critical business goals:
1. Reducing breach risk and data loss through sensitive data protection.
2. Solving Hadoop compliance, privacy, and regulatory mandates for PII, PCI, PHI, HIPAA and data privacy and data residency laws.
Detecting, protecting, and auditing sensitive data gives organizations an additional layer of protection beyond what’s possible with existing access control, authentication, authorization, and data-at-rest encryption available in Hadoop. This protective layer hands organizations the unique power to pinpoint, control, and audit sensitive data.
Data-centric security is essential for organizations that need to:
- Share data with “semi-trusted” users either inside the organization or externally.
- Sell data to third-party partners.
- Control and monitor internal access to gain better protection against insider risk.
- Partially reveal data — while preserving data uniqueness for analytics — through intelligent data masking or format-preserving encryption.
Most customers in regulatory markets are already using existing Hadoop security capabilities around authentication access control (Kerberos), file system, and network isolation and segmentation (ACLs and network Firewalls), file or volume encryption, as well as activity monitoring (logging, auditing, and data lineage). Dataguise can fit and enrich existing systems in a simple, non-blocking manner by operating at the sensitive data element level (e.g., locking IDs or names).
Businesses are rapidly adding new data sources to Hadoop analytics, including: logging data, clickstream data, customer feedback, and sentiment data. As a result, all of this new data going into Hadoop is increasingly “gray” — it’s harder to retain or maintain its structure, harder to cleanse it, harder to determine its location and the amount of sensitive data within it. Sensitive data detection helps organizations understand and protect this new, co-mingled, raw, and noisy data inside Hadoop.
The Dataguise detection process begins by defining a security policy. Organizations select which sensitive elements they need to detect. The rest of the process is automated. Through agents for data ingest (Flume, Sqoop, FTP) as well as agents for at-rest data (HDFS, HIVE, PIG), Dataguise analyzes all data and filters and counts sensitive data elements for .txt, csv, logging, AVRO, SequenceFile, as well as common unstructured data formats (Word, Excel, PowerPoint, SMS, Email).
Our encrypt engine runs as an automated process (“agent”) for data loaders (FTP, FLUME, SQOOP). We also support native field and row encryption inside an HDFS encryption agent. More generically, we have a JAR for invoking encryption and decryption and have built decryption UDFs for Pig, HIVE, and MapReduce.
Dataguise is certified on all three major Hadoop distributions: Cloudera, Hortonworks, and MapR. In addition, downloadable Sandbox trials are available on both the Hortonworks and MapR partner websites. We also have production Hadoop customers using DgSecure for Hadoop with Apache Hadoop, Amazon Elastic MapReduce, and Pivotal HD.
DgSecure detection helps organizations identify, locate, and classify sensitive data by cataloging and summarizing it. Dataguise customers can determine which sensitive data types exist and where they reside in enterprise data sources including relational databases, file systems, and big data Hadoop platforms. Using purpose-built lightweight software agents that run directly against such data repositories or within data pipelining tools (FTP, Flume, Sqoop), DgSecure detection can provide a comprehensive inventory of sensitive data across the enterprise landscape. The discovery results are presented via intuitive dashboards and reports that detail sensitive data by location and type.
DgSecure detection can scan for sensitive data elements across a wide variety of common enterprise data sources and automatically detect/handle common file formats across data stores, including :
- Relational Databases (RDBMS): Oracle, SQL Server, DB2 (Linux), DB2 (mainframe), Postgres, MySQL, Greenplum, Sybase, Teradata.
- Non-relational big data stores.
- Hadoop: All major distributions (Cloudera, Hortonworks, MapR, Pivotal, IBM BigInsights, Amazon EMR).
- Cloud (e.g., Amazon cloud, Azure): AWS Hadoop (EMR), Azure (SQL Server).
- Structured file formats: Relational database tables, AVRO, Sequence, RC, ORC, HIVE Tables.
- Unstructured/semi-structured file formats: Microsoft Office (DOC, XLS, PPT, DOCX, XLSX, PPTX), plain text formats (TXT, CSV), log formats, Adobe PDF, social media streams (e.g., Twitter feeds), and clickstreams.
In addition, DgSecure can inspect data while it is being moved or transformed across the enterprise using ETL and pipelining tools including Apache Flume, FTP, and Apache Sqoop.
Using DgSecure, enterprises can:
- Uncover, catalog, and summarize sensitive data in unknown or hard-to-measure areas of the enterprise data landscape residing in log files, clickstreams, machine data, user-driven content (e.g., web, mobile, office files, etc.).
- Assist with forensic investigations or data breach preparedness at an enterprise by cataloging the potential universe of sensitive information vulnerabilities and by correlating access patterns to sensitive information with the extent and timing of the breach events.
- Help improve compliance with local, national and international privacy regulations in a practical, repeatable, and efficient manner.
- Provide data categorization to help build suitable policies to secure private and sensitive data. This includes helping build the foundation to determine what data “needs” to be protected and how, and to whom it needs to be accessible in a de-protected form.
Dataguise customers can search through structured, semi-structured, or unstructured content to find a variety of sensitive data elements, including credit cards, Social Security Numbers, names, addresses, medical IDs, ABA bank routing numbers, and financial codes. In addition to pre-defined templates for these sensitive data types, our customers can also extend and build their own custom sensitive data elements through a sophisticated regex builder.
DgSecure detection is a highly scalable, resilient, fault-tolerant, and customizable enterprise-class service for identifying and summarizing sensitive data at the element level.
DgSecure detection capabilities:
- Handles high volumes of disparate, constantly moving, and changing data.
- Supports a fluid or flexible information governance model that has a mix of highly “invested” (curated) data as well as raw, unexplored (gray) data including IoT (Internet of Things) data, clickstreams, feeds, and logs.
- Handles a variety of data stores such as traditional relational databases and enterprise data warehouses as well as non-relational big data sources (Hadoop) and file repositories (SharePoint and file shares).
- Processes structured, semi-structured, and unstructured or freeform data formats.
- Provides automated detection and processing of a variety of file formats and file/directory structures, leveraging meta-data and schema-on-read where applicable.
- Provides deep content inspection using techniques such as patent-pending neural-like network (NLN) technology, and dictionary-based and weighted keyword matches.
- Employs breakthrough computational and statistical approaches to detect sensitive data more accurately.
Yes, DgSecure detection provides out-of-the-box pre-built templates for identifying sensitive data regulated by national and international data protection laws and industry standards including:
- HIPAA
- PCI
- PHI
- Pll
- FERPA
- HITECH
In addition, DgSecure detection can be extended or tailored for unique requirements, such as custom data that requires special protection due to intellectual property and/or competitive requirements.
DgSecure detection is designed to maximize ease of use, simplicity, and automation. Configuring and running detection involves the following simple steps:
- Defining a policy for the sensitive elements that a user needs to find.
- Selecting top-level locations (e.g., directory or sets of directories) to scan.
- Selecting options such as sample size and black/whitelist of sensitive items to use for scanning.
- Choosing whether to re-execute the detection at pre-set intervals incrementally.
Once the scan is complete, DgSecure detection presents the user with a consolidated view of all the sensitive data elements of interest and their locations. Users can then drill down into specific areas and set up appropriate filtered views of their data. From the same interface, users can apply appropriate data protection to the discovered elements using DgSecure’s masking or encryption services.
Yes, DgSecure uses structural (e.g., schemas, indices, file structures, etc.) and contextual cues wherever possible to speed up discovery. For example, it can determine the file type of each file in the scan path automatically and infer the structure of the files by using heuristics or user provided set of schemas and/or structure definitions to speed up matching during the scan. In addition, Dataguise is continually enhancing our DgSecure detection capability to leverage the full context of data. For example, recognizing sensitive data when only partial information is available, using ontologies to disambiguate data, building Bayesian models to update certainty of sensitive data found after processing more data, and processing unstructured text data using NLP integrated into the DgSecure detection engine.
Yes, DgSecure detection can be deployed in distributed environments. It is designed to leverage resources optimally in a multi-node or multi-host distributed deployment at scale. For example, DgSecure detection for Hadoop leverages distributed computing by having an agent-based architecture that runs all discovery tasks natively as Java Map-Reduce jobs on the Hadoop cluster. DgSecure detection for databases also uses an agent-based architecture to run as a multi-threaded service across database instances within an enterprise.
While discovery can be a resource-intensive operation, even in a distributed architecture, DgSecure detection can be tuned to fit within an enterprise’s infrastructure constraints. For example, the Hadoop HDFS agent can be throttled by limiting it to a certain number of maps. Our experience with large, global production deployments at customer sites proves DgSecure detection scales with a low performance overhead of 5-10%. Dataguise is continually working on minimizing the performance overhead to improve discovery performance.
DgSecure detection uses three different techniques to minimize false positives/negatives:
- First, contextual data is leveraged (e.g., column names, key words, reference data, meta-data, and primary key/foreign key relationships) to more accurately match and disambiguate sensitive data elements.
- Second, for structured data, DgSecure detection can present a confidence score so users can filter based on the extent of the match, reducing or removing nearby mismatches.
- Third, Dataguise is continually adding advanced computational and statistical methods to reduce false negatives including NLP, Bayesian inference, domain-based ontologies, and machine learning techniques.