Discovering dirty data

No, this isn’t a post about a discovery assignment for the adult industry 😉

As a data-centric security provider focused on Hadoop and Big Data security, we see a lot of different types of data—classic data warehouse and archive data, online transaction and online analytic data (OLTP and OLAP), relational data in databases, CSV and Excel files, and increasingly, clickstreams, docs and PDFs, Internet of things sensor data, and twitter feeds.

Whenever we work with data, and especially when we are running discovery jobs, we like it clean – well formatted, schema-driven, highly regulated. Of course, our customers need discovery against all their data, and they need it most on their “grey data”. Grey data is not necessarily clean. It’s raw, co-mingled, noisy, machine generated, and occasionally even hypothetical.

Hadoop is increasingly a world of Gray Data

Dataguise Hadoop Gray Data width=”300″ height=”192″ />

From PWC’s Making Sense of Big Data report

Recently, we ran a POC that had lots of Gray data, and some of it was simply “broken.” By broken, I mean that certain fields had been concatenated together. For example, a typical address might look like this:

383 S 22nd Street, San Jose CA 95116

But in our sample, there was a small difference:

 

383 S 22nd Street, San Jose CA95116

 

 

Notice the small change? By combining state and zip code together, our discovery engine was missing the recognition of State and therefore lowering the confidence on some of the address discoveries. The engine performed as it should have. However, we debated at what degree malformed addresses should still be “discovered” at all.

Fortunately, we won’t have to fully determine these choices on our own. We are building some new “fuzzy logic” into the discovery engine that will allow for a degree of customer tuning, so we can detect some degree of malformed data and match “nearby” hits in gray data.

Of course, some customers would like us to prune, clean, fix dirty data, but that’s where we draw the line. We’re data “locks” not data scrubs, we leave (that) dirty work to the MDM providers of the world!