What’s Worse: Underestimating or Overestimating the Size of a Data Breach?

As the amount of data stored and used by organizations around the world steadily increases, so do the risks and costs associated with lost or exposed personal information. While the reported number of data breaches varies by reporting agency, depending on the criteria used, one unmistakable trend is clear: The number of data breaches and the amount of data exposed continues to increase steadily, in conjunction with total data stored by organizations.1

For example, according to one report from Forbes,2 over the first six months of the year 2019, more than 3,800 data breaches were reported. The number of compromised records in these breaches is estimated to be about 4.1 billion, and the average breach costs the affected company $3.92 million.3

While Dataguise always recommends organizations rethink data management and protection practices to lower the risk of a data breach, the fact remains, breaches are inevitable. Organizations need to operate under the assumption that it will happen to them, and when it does, the key factors become timing, severity, and the ability to respond proactively to remediate and recover both from a financial and reputational perspective.

Breach reporting requirements

The General Data Protection Regulation (GDPR) requires a data breach to be reported to the Data Protection Authority (DPA) within 72 hours of its discovery. Affected persons are to be notified if the risk to their rights and freedoms due to the breach is high. The California Consumer Privacy Act (CCPA) doesn’t have a time-based deadline such as 72 hours but requires businesses to notify affected persons as early as possible, and to also notify the government agency if the breach affects more than 500 persons. Both regulations require as much information to be submitted about the breach as possible. Under both regulations, one of the critical pieces of information that must be reported is the estimated number of unique accounts or persons affected by a breach.

There are dangers in both under- and over-counting the number of affected people. Under-counting, which is more common, reduces the credibility of the organization, and opens itself to lawsuits related to negligence. Over-counting also reduces the credibility, and in addition, creates unnecessary panic, as well as bloated insurance and other costs. Several well-known organizations such as LinkedIn and Dropbox have had to issue corrections to their original estimates of affected persons, increasing the original numbers by orders of magnitude. LinkedIn announced initially in June 2012 that 6.5 million users were affected in a breach that year. However, it revised the estimate four years later, to 117 million users, when third parties reported that more credentials were up for sale on the dark web.4 In the Dropbox case, the initial announcement was vague about the number of accounts impacted, but a later correction admitted that 68 million accounts were affected.5

Among the items that are to be reported when a breach occurs, the estimated number of affected persons is a particularly challenging and critical element. Undercounting often is the result of the organization leaving out large groups of affected data sets in their initial count. Within the data sets that are counted, however, there is a high likelihood of over-counting unless a brute-force unique count is performed. This is because, transactional data systems are likely to contain repeated transactions by the same person. Examples include multiple bookings by the same person on a travel website, or multiple charges to a credit card on a given day. If the duplicates in the above cases are not weeded out, the reported number of affected persons could be a serious overcount. The overcount does not necessarily offset the effect of leaving out large data sets from the start, resulting in overall undercounts of affected persons.

Improve your response and get your reporting right the first time

In summary, one of an organization’s greatest concerns is managing the risks associated with a data breach. These risks come in many varieties, such as brand/reputation, revenue impacts (compensating impacted customers, fines, lawsuits), operational disruption, and the risks associated with how the organization is prepared to handle the aftermath of a breach in an effective way. As we saw above, accurate estimation of the unique number of data subjects or accounts affected by a breach is critical for effective breach reporting. When big data is involved, it is computationally impractical to try to count all the unique data subjects or accounts in the entirety of the data, especially given tight compliance timeframes. Taking a sample and extrapolating to a level of accuracy is important.

Dataguise has developed a patent-pending method involving neural network computation that is able to extrapolate and project the unique counts in a data set to a very high level of accuracy by taking a small sample of data. The method has been tested with real-world customer data and has been found to be more than 90% accurate in its projections. With this breakthrough technology, organizations can lower risk by giving both customers and the data protection authorities a very accurate estimation of the impact of a data breach.

Learn more about the Dataguise Data Discovery + Protection 8.0 release here

About the author:

Subra Ramesh, Senior Vice President, Products and Technology at Dataguise

Subra has 25 years of experience developing and architecting distributed systems and middleware products. You can check out his full bio here.