Our Library



Why do leading companies rely on Dataguise for data security in Hadoop?

Dataguise delivers a unique approach to protecting sensitive data assets in Hadoop. We combine intelligent data detection with automated, data-centric encryption, masking, and audit tools for Flume, Sqoop, FTP, MapReduce, Hive, and SPARK.
What types of organizations work with Dataguise?
Dataguise customers span a broad range of industries — financial services, insurance, healthcare, government, technology, and retail — and include some of the world’s largest, industry-leading companies. We work with organizations that embrace the tremendous potential of big data and are committed to being responsible data stewards.

What business goals does Dataguise help enterprise's achieve?

We help our customers achieve two increasingly critical business goals:

1. Reducing breach risk and data loss through sensitive data protection.

2. Solving Hadoop compliance, privacy, and regulatory mandates for PII, PCI, PHI, HIPAA and data privacy and data residency laws.

When should a company use data-centric security in Hadoop instead of existing security solutions?

Detecting, protecting, and auditing sensitive data gives organizations an additional layer of protection beyond what’s possible with existing access control, authentication, authorization, and data-at-rest encryption available in Hadoop. This protective layer hands organizations the unique power to pinpoint, control, and audit sensitive data.

What organizations need data-centric security to keep their data secure?

Data-centric security is essential for organizations that need to:

  • Share data with “semi-trusted” users either inside the organization or externally.
  • Sell data to third-party partners.
  • Control and monitor internal access to gain better protection against insider risk.
  • Partially reveal data — while preserving data uniqueness for analytics — through intelligent data masking or format-preserving encryption.

How does Dataguise work within Hadoop distribution security that is already in place?

Most customers in regulatory markets are already using existing Hadoop security capabilities around authentication access control (Kerberos), file system, and network isolation and segmentation (ACLs and network Firewalls), file or volume encryption, as well as activity monitoring (logging, auditing, and data lineage). Dataguise can fit and enrich existing systems in a simple, non-blocking manner by operating at the sensitive data element level (e.g., locking IDs or names).

Why is sensitive data detection so critical?

Businesses are rapidly adding new data sources to Hadoop analytics, including: logging data, clickstream data, customer feedback, and sentiment data. As a result, all of this new data going into Hadoop is increasingly “gray” — it’s harder to retain or maintain its structure, harder to cleanse it, harder to determine its location and the amount of sensitive data within it. Sensitive data detection helps organizations understand and protect this new, co-mingled, raw, and noisy data inside Hadoop.

How does the Dataguise detection process work?

The Dataguise detection process begins by defining a security policy. Organizations select which sensitive elements they need to detect. The rest of the process is automated. Through agents for data ingest (Flume, Sqoop, FTP) as well as agents for at-rest data (HDFS, HIVE, PIG), Dataguise analyzes all data and filters and counts sensitive data elements for .txt, csv, logging, AVRO, SequenceFile, as well as common unstructured data formats (Word, Excel, PowerPoint, SMS, Email).

What data types can Dataguise encrypt?

Our encrypt engine runs as an automated process (“agent”) for data loaders (FTP, FLUME, SQOOP). We also support native field and row encryption inside an HDFS encryption agent. More generically, we have a JAR for invoking encryption and decryption and have built decryption UDFs for Pig, HIVE, and MapReduce.

What distributions does Dataguise support?

Dataguise is certified on all three major Hadoop distributions: Cloudera, Hortonworks, and MapR. In addition, downloadable Sandbox trials are available on both the Hortonworks and MapR partner websites. We also have production Hadoop customers using DgSecure for Hadoop with Apache Hadoop, Amazon Elastic MapReduce, and Pivotal HD.

What is DgSecure detection?

DgSecure detection helps organizations identify, locate, and classify sensitive data by cataloging and summarizing it. Dataguise customers can determine which sensitive data types exist and where they reside in enterprise data sources including relational databases, file systems, and big data Hadoop platforms. Using purpose-built lightweight software agents that run directly against such data repositories or within data pipelining tools (FTP, Flume, Sqoop), DgSecure detection can provide a comprehensive inventory of sensitive data across the enterprise landscape. The discovery results are presented via intuitive dashboards and reports that detail sensitive data by location and type.

Which repositories and platforms can DgSecure scan for sensitive data?

DgSecure detection can scan for sensitive data elements across a wide variety of common enterprise data sources and automatically detect/handle common file formats across data stores, including :

  • Relational Databases (RDBMS): Oracle, SQL Server, DB2 (Linux), DB2 (mainframe), Postgres, MySQL, Greenplum, Sybase, Teradata.
  • Non-relational big data stores.
  • Hadoop: All major distributions (Cloudera, Hortonworks, MapR, Pivotal, IBM BigInsights, Amazon EMR).
  • Cloud (e.g., Amazon cloud, Azure): AWS Hadoop (EMR), Azure (SQL Server).
  • Structured file formats: Relational database tables, AVRO, Sequence, RC, ORC, HIVE Tables.
  • Unstructured/semi-structured file formats: Microsoft Office (DOC, XLS, PPT, DOCX, XLSX, PPTX), plain text formats (TXT, CSV), log formats, Adobe PDF, social media streams (e.g., Twitter feeds), and clickstreams.

In addition, DgSecure can inspect data while it is being moved or transformed across the enterprise using ETL and pipelining tools including Apache Flume, FTP, and Apache Sqoop.

How can an enterprise benefit from DgSecure detection?

Using DgSecure, enterprises can:

    • Uncover, catalog, and summarize sensitive data in unknown or hard-to-measure areas of the enterprise data landscape residing in log files, clickstreams, machine data, user-driven content (e.g., web, mobile, office files, etc.).
    • Assist with forensic investigations or data breach preparedness at an enterprise by cataloging the potential universe of sensitive information vulnerabilities and by correlating access patterns to sensitive information with the extent and timing of the breach events.
    • Help improve compliance with local, national and international privacy regulations in a practical, repeatable, and efficient manner.
    • Provide data categorization to help build suitable policies to secure private and sensitive data. This includes helping build the foundation to determine what data “needs” to be protected and how, and to whom it needs to be accessible in a de-protected form.

What types of sensitive data can DgSecure detection find?

Dataguise customers can search through structured, semi-structured, or unstructured content to find a variety of sensitive data elements, including credit cards, Social Security Numbers, names, addresses, medical IDs, ABA bank routing numbers, and financial codes. In addition to pre-defined templates for these sensitive data types, our customers can also extend and build their own custom sensitive data elements through a sophisticated regex builder.

What is unique about DgSecure detection?

DgSecure detection is a highly scalable, resilient, fault-tolerant, and customizable enterprise-class service for identifying and summarizing sensitive data at the element level.

DgSecure detection capabilities:

      • Handles high volumes of disparate, constantly moving, and changing data.
      • Supports a fluid or flexible information governance model that has a mix of highly “invested” (curated) data as well as raw, unexplored (gray) data including IoT (Internet of Things) data, clickstreams, feeds, and logs.
      • Handles a variety of data stores such as traditional relational databases and enterprise data warehouses as well as non-relational big data sources (Hadoop) and file repositories (SharePoint and file shares).
      • Processes structured, semi-structured, and unstructured or freeform data formats.
      • Provides automated detection and processing of a variety of file formats and file/directory structures, leveraging meta-data and schema-on-read where applicable.
      • Provides deep content inspection using techniques such as patent-pending neural-like network (NLN) technology, and dictionary-based and weighted keyword matches.
      • Employs breakthrough computational and statistical approaches to detect sensitive data more accurately.

Does DgSecure detection support discovering sensitive data for common data protection regulations and standards? If so, which ones?

Yes, DgSecure detection provides out-of-the-box pre-built templates for identifying sensitive data regulated by national and international data protection laws and industry standards including:

      • HIPAA
      • PCI
      • PHI
      • Pll
      • FERPA
      • HITECH

In addition, DgSecure detection can be extended or tailored for unique requirements, such as custom data that requires special protection due to intellectual property and/or competitive requirements.

How complex is the setup and operation of DgSecure detection?

DgSecure detection is designed to maximize ease of use, simplicity, and automation. Configuring and running detection involves the following simple steps:

      • Defining a policy for the sensitive elements that a user needs to find.
      • Selecting top-level locations (e.g., directory or sets of directories) to scan.
      • Selecting options such as sample size and black/whitelist of sensitive items to use for scanning.
      • Choosing whether to re-execute the detection at pre-set intervals incrementally.

Once the scan is complete, DgSecure detection presents the user with a consolidated view of all the sensitive data elements of interest and their locations. Users can then drill down into specific areas and set up appropriate filtered views of their data. From the same interface, users can apply appropriate data protection to the discovered elements using DgSecure’s masking or encryption services.

Can DgSecure detection leverage external information to look for specific sensitive data elements? If so, how?

Yes, DgSecure uses structural (e.g., schemas, indices, file structures, etc.) and contextual cues wherever possible to speed up discovery. For example, it can determine the file type of each file in the scan path automatically and infer the structure of the files by using heuristics or user provided set of schemas and/or structure definitions to speed up matching during the scan. In addition, Dataguise is continually enhancing our DgSecure detection capability to leverage the full context of data. For example, recognizing sensitive data when only partial information is available, using ontologies to disambiguate data, building Bayesian models to update certainty of sensitive data found after processing more data, and processing unstructured text data using NLP integrated into the DgSecure detection engine.

Does DgSecure detection support a distributed deployment and can it be scaled across distributed architectures?

Yes, DgSecure detection can be deployed in distributed environments. It is designed to leverage resources optimally in a multi-node or multi-host distributed deployment at scale. For example, DgSecure detection for Hadoop leverages distributed computing by having an agent-based architecture that runs all discovery tasks natively as Java Map-Reduce jobs on the Hadoop cluster. DgSecure detection for databases also uses an agent-based architecture to run as a multi-threaded service across database instances within an enterprise.

How does this solution scale out? What are the performance impacts of data-centric discovery?

While discovery can be a resource-intensive operation, even in a distributed architecture, DgSecure detection can be tuned to fit within an enterprise’s infrastructure constraints. For example, the Hadoop HDFS agent can be throttled by limiting it to a certain number of maps. Our experience with large, global production deployments at customer sites proves DgSecure detection scales with a low performance overhead of 5-10%. Dataguise is continually working on minimizing the performance overhead to improve discovery performance.

How does DgSecure detection handle false positives and false negatives?

DgSecure detection uses three different techniques to minimize false positives/negatives:

      • First, contextual data is leveraged (e.g., column names, key words, reference data, meta-data, and primary key/foreign key relationships) to more accurately match and disambiguate sensitive data elements.
      • Second, for structured data, DgSecure detection can present a confidence score so users can filter based on the extent of the match, reducing or removing nearby mismatches.
      • Third, Dataguise is continually adding advanced computational and statistical methods to reduce false negatives including NLP, Bayesian inference, domain-based ontologies, and machine learning techniques.