What to Include in Your RFP and POC for Personal Information Discovery Technology

As companies are looking to minimize their risk profile and address privacy requirement, they look for tools that will allow them to meet the first step in building an effective privacy program – creating (and maintaining) an inventory of the personal information they process across their enterprise.

There are many technology vendors in the market offering such solutions, but not a lot of guidance on the features that are most important in such tools. Asking for the key features in your Request-for-Proposal and testing those features during your Proof-of-Concept will save companies time and money as they go about selecting such a solution. The following is a list of the most important technical capabilities for the discovery of personal information at scale.

How accurate is the discovery?

  • Require the competing technologies to demonstrate whether and how their tools can apply a variety of intelligent methodologies to limit false positive in the discovery process for more accurate results.
  • Compare the technologies’ ability to create custom element definitions that can be applied with low false-positive results. Compare the options are provided by each technology for creating the custom elements, such as the use of context, validations, inclusion/exclusion lists, and confidence factor variables.
  • Compare the technologies for their appropriate use of machine learning for discovery.
  • Compare the technologies for their ability to discover relevant data elements in documents.
  • Confirm that after correcting the tool for making a wrong discovery call for a data element, the technology does not repeat the error in a subsequent scan.
  • Compare the technologies for the flexibility they offer in adding the validation logic to custom-defined sensitive elements.
  • Identify whether different technologies can be more efficient in the identification of data subjects (individuals) by utilizing the data element discovery process.

Is the technology a good fit for your IT environment and data protection policies?

  • Compare the technologies for the breadth of platforms, systems, environments and file types where they are able to perform the discovery of personal information.
  • Provide your data classification policy to the competing vendors and compare whether their tools accurately classify the discovered data according to it.
  • Compare the technologies for the granularity, and relevance to your operations, of the data elements they can discover out-of-the-box.
  • Compare the technologies for their ability to apply data minimization techniques to the discovered data (e.g., partial masking of certain data elements), as part of the discovery process, to limit the exposure of sensitive information.
  • Compare the technologies for providing a functional audit trail for all activities and tasks.
  • Compare the technologies for their ability to send alerts based on selected events related to the scanning process and policy violations based on the sensitivity of the file or database object.
  • Compare the technologies for their ability to automatically protect discovered data in accordance with your policy, without impacting downstream applications.
  • Compare the technologies for their ability to identify high-risk combinations of data elements at the table or document level (rather than just individual elements), in accordance with your data classification policy.
  • Compare the technologies for their out-of-the-box reporting capabilities, paying particular attention to how those reports can address your different compliance and risk management policy requirements.

Can the technology accommodate a large-scale environment through the use of efficient processing?

  • Validate claims of automatic discovery, i.e., without human guidance or the need to review system documentation, by observing the discovery process and tracking its duration.
  • Compare the time it takes the different technologies to recognize the schemas, tables, and data elements of interest in several systems.
  • Compare the technologies according to their ability to conduct both a full scan of the data to be discovered and the use of sampling techniques for that purpose. Compare and test the diversity of available sampling options.
  • Compare the accuracy and speed by which each technology detects column-level metadata (i.e., what is in this column) and specific data elements (i.e., the different values within the column).
  • When comparing the technologies for their ability to discover personal information in unstructured documents (Word, PDF, PPT, etc.), identify whether they can automatically detect the file type, size, ownership, permissions and modification times, and whether the user can select specific files or file types for scanning to save time (e.g., avoid log files).
  • Compare the technologies for their ability to stop the scan process in mid-scan and then resume the scan from that point, rather than restarting the scan from the beginning.
  • Compare the technologies for their ability to save scanning time by using a distributed architecture for scanning data closer to the target system, leveraging flexible multithreading to utilize multiple CPUs.
  • Compare the technologies for their ability to identify incremental data additions to previously scanned databases and document repositories.
  • Identify whether the technologies can compare scans for differences in data schema and data elements between current and previous scans.
  • Compare the technologies for their ability to maintain the actual sensitive data in the repository it was found in (i.e., rather than copy it to another location) to reduce risk and improve efficiency.
  • Compare the time it takes for the different technologies to identify custom elements as opposed to out-of-the-box elements.
  • Compare for completeness of the APIs provided with each of the technologies and the functions those APIs address as they relate to the intended use of the technology in your organization.
  • Have the different technologies demonstrate their built-in scheduler functionality for handling different task types and run them within a set time window.
  • Have the competing technologies demonstrate how resources on target repositories can be throttled up or down depending upon the time of day to accommodate the needs of other users.

How easy it is to deploy the discovery technology in your environment?

  • Compare the technologies for their ability to automate an end-to-end workflow without the need for a GUI.
  • Compare the technologies for their deployment options – container, non-container, multi-OS support, hybrid (cloud and on-premise), across geographies with central reporting server without transporting actual sensitive data back.
  • Confirm whether the competing technologies can operate without having to be installed on the target production repository.

As you can tell from this list of capabilities, comparing discovery technologies can be done objectively. Remember to establish objectives goals for the different discovery features you intend to compare, and selecting the right technology at the end of the POC will be straight forward.

About the author:

Sagi Leizerov, Ph.D., SVP Enterprise Privacy Solutions at Dataguise

Sagi is a Certified Information Privacy Professional (CIPP/US) with over 20 years of privacy and data governance experience. You can check out his full bio here.