FAQs

Discovery

DgSecure Discovery helps customers to identify, locate, and classify sensitive data by cataloging and summarizing sensitive data. The DgSecure Discovery helps customers determine which sensitive data types exist and where they reside in enterprise data sources such as relational databases, file systems, and Big Data Hadoop platforms. Using purpose built lightweight software agents that run directly against such data repositories or within data pipelining tools (FTP, Flume, Sqoop), DgSecure Discovery can provide a comprehensive inventory of sensitive data across the enterprise landscape. The discovery results are presented via intuitive dashboards and reports which detail sensitive data by location and type.

DgSecure Discovery service can scan for sensitive data elements across a wide variety of common enterprise data sources and automatically detect/handle common file formats across such data stores:

  • Relational Databases (RDBMS):  Oracle, SQL Server, DB2 (Linux), DB2 (mainframe), Postgres, MySQL, Greenplum, Sybase, Teradata
  • Non-relational Big Data stores
    • Hadoop: All major distributions (Cloudera, Hortonworks, MapR, Pivotal, IBM BigInsights, Amazon EMR)
    • Cloud (e.g. Amazon cloud, Azure): AWS Hadoop (EMR), Azure (SQL Server)
  • Structured file formats: Relational database tables, AVRO, Sequence, RC, ORC, HIVE Tables
  • Unstructured/semi-structured file formats: Microsoft Office (DOC, XLS, PPT, DOCX, XLSX, PPTX), Plain Text formats (TXT, CSV), log formats, Adobe PDF, Social media streams (for example Twitter feeds) and clickstreams

In addition, DgSecure Discovery can inspect data while it is being moved or transformed across the enterprise using ETL and pipelining tools such as Apache Flume, FTP and Apache Sqoop (Kafka available late Q2 ‘15).

  • Uncover, catalog, and summarize sensitive data in unknown or hard-to-measure areas of the enterprise data landscape residing in log files, clickstreams, machine data, user-driven content (web, mobile, office files, etc.)
  • Assist with forensic investigations or data breach preparedness at an enterprise by cataloging the potential universe of sensitive information vulnerabilities and by correlating access patterns to sensitive information with the extent and timing of the breach events
  • Help to improve compliance with local, national and international privacy regulations in a practical, repeatable, and efficient manner
  • Provide data categorization to help build suitable policies to secure private and sensitive data (i.e. help build the foundation to determine which data “needs” to be protected and how, and to whom it needs to be accessible in de-protected form)
Customers can search through structured, semi-structured, or unstructured content to find a variety of sensitive data elements such as Credit Cards, Social Security Numbers, Names, Addresses, Medical IDs, ABA bank routing numbers, and financial codes. In addition to pre-defined templates for such sensitive data types, customers can also extend and build their own custom sensitive data elements through a sophisticated regex builder.

DgSecure Discovery is a highly scalable, resilient, fault-tolerant, and customizable enterprise class service for identifying and summarizing sensitive data at the element level. Specifically DgSecure Discovery:

  • Handles high volumes of disparate, constantly moving, and changing data
  • Supports a fluid or flexible information governance model that has a mix of highly “invested” (curated) data as well as raw unexplored data (gray data) such as IoT (Internet of Things) data, clickstreams, feeds and logs
  • Handles a variety of data stores such as traditional relational databases and enterprise data warehouses as well as non-relational Big Data sources (Hadoop), file repositories (Sharepoint and File shares)
  • Processes structured, semi-structured, and unstructured or freeform data formats
  • Provides automated detection and processing of a variety of file formats, file/directory structures leveraging meta-data and schema-on-read where applicable
  • Provides deep content inspection using techniques such as patent¬ pending neural-like network (NLN) technology, dictionary¬ based and weighted keyword matches
  • Employs breakthrough computational and statistical approaches to detect sensitive data more accurately

Yes. DgSecure Discovery provides out-of-the-box pre-built templates for identifying sensitive data regulated by national and international data protection laws and industry standards including:

  • HIPAA
  • PCI
  • PHI
  • Pll
  • FERPA
  • HITECH

In addition, DgSecure Discovery can be extended or tailored for unique requirements (such as custom data that requires special protection due to intellectual property and/or competitive requirements).

DgSecure Discovery is designed for ease of use, simplicity, and automation. Configuring and running is as easy as:

  • Defining a policy for which sensitive elements that a user needs to find
  • Selecting top level locations (directory or sets of directories) to scan
  • Selecting options such as sample size and black/whitelist of sensitive items to use for scanning
  • Selecting option choose to re-execute discovery at pre-set intervals incrementally

Once the scan is complete, DgSecure Discovery presents the user with a consolidated view of all the sensitive data elements of interest and their locations. Users can now drill down into specific areas and set up appropriate filtered views of the data. From the same interface, users can apply appropriate data protection to the discovered elements using DgSecure’s masking or encryption services.

Yes, using structural (schemas, indices, file structures etc.) and contextual cues wherever possible to speed up discovery. For example, it can determine the file type of each file in the scan path automatically and infer the structure of the files by using heuristics or user provided set of schemas and/or structure definitions to speed up matching during the scan.

In addition, Dataguise is continually enhancing the capability of DgSecure Discovery to leverage the full context of data. For example, recognizing sensitive data when only partial information is available, using ontologies to disambiguate data, building Bayesian models to update certainty of sensitive data found after processing more data, processing unstructured text data using NLP integrated into DgSecure Discovery engine.

Yes. DgSecure Discovery can be deployed in distributed environments. It is designed to leverage resources optimally in a multi-node or multi-host distributed deployment at scale. For example, DgSecure Discovery for Hadoop leverages distributed computing by having an agent based architecture that runs all discovery tasks natively as Java Map-Reduce jobs on the Hadoop cluster. DgSecure Discovery for Databases also uses an agent based architecture to run as a multi-threaded service across database instances within an enterprise.
While discovery can be a resource intensive operation even in a distributed architecture, DgSecure Discovery is tunable by a customer to fit within their infrastructure constraints. For example, the Hadoop HDFS agent can be throttled by limiting it to a certain number of maps. Our experience with large global production deployments at customer sites shows DgSecure Discovery scales with a low performance overhead of 5-10%. Dataguise is continually working on minimizing the performance overhead to improve discovery performance.
DgSecure Discovery uses three different techniques to minimize false positives/negatives. First, contextual data is leveraged (column names, key words, reference data, meta-data, and primary key/foreign key relationships) to more accurately match and disambiguate sensitive data elements. Second, for structured data, the DgSecure Discovery can present a confidence score, so that users can filter based on extent of match, reducing or removing nearby mismatches. Third, Dataguise is continually adding advanced computational and statistical methods to reduce false negatives including NLP, Bayesian inference, domain-based ontologies and machine learning techniques.