Halloween has come and gone, but big scares lurk in the forecast for security officers, who should be frightened to the bone about the prospect of their Hadoop clusters getting hacked.
Every week seems to bring news of a major data breach. In just the past two months, Home Depot and Staples have suffered data breaches involving credit card data, while JPMorgan Chase was the victim of a cyber attack. Russian criminals with computer skills get the credit (or the blame) in many of these cases. But you can’t rule out disgruntled internal users or corrupt contractors, as was the case with Target’s massive breach disclosed earlier this year.
We’ll probably never know if Hadoop was hacked in any of these cases. Major corporations, as a rule, do not talk about their information security, and they don’t discuss their Hadoop implementations. So when Hadoop and security come together, their lips are doubly sealed.
But based on what we know about Hadoop and its complete lack of default security, the prospect of Hadoop getting hacked is something that chief information security officers (CISOs) should take very seriously.
“Hadoop itself is very weak in security. You can be a Linux user and take all the data from Hadoop,” says Manmeet Singh, co-founder and CEO of Dataguise, a provider of data masking and encryption tools for Hadoop. “The problem is the insider threat. Anybody can walk away with billions of credit card numbers.”
Apache Hadoop itself was developed in the open source arena with little to no regard for security. As Hadoop started to go mainstream and people started to become aware of Hadoop’s security problems, the distributors and the Apache community picked up the pace and started building security add-ons for various aspects of security, such as access control and authentication (Apache Knox), authorization (Apache Sentry), encryption (Cloudera’s Project Rhino), and security policy management and user monitoring (the proposed Apache Argus based on Hortonworks‘ XA Secure acquisition).
These open source projects have come a long way toward improving Hadoop’s out-of-the-box security, but they’re generally not mature enough for a corporation to rely on for production workloads clusters. According to Singh, Hadoop distributors will lean on third-party vendors like Dataguise to provide enterprise-class security solutions for the foreseeable future.
“Platform companies want to stay away from security,” Singh says. “If they don’t, once the platform is brought in and security starts looking at it, the sale becomes very difficult for them, so they want partners to provide this solution for them.”
Jim Vogt, the CEO of Hadoop encryption provider Zettaset, agrees with that assessment. “There’s a data at rest encryption module [Hadoop Cryptographic File System] that just came out of the open source community,” Vogt says. “Hortonworks said it’s not ready for prime time, it’s not working right.” Instead, Hortonworks is partnering with Zettaset and offering its encryption-at-rest capabilities to HDP customers.
Encryption is just one of the capabilities covered by Zettaset’s flagship product, Orchestrator, which also provides high availability failover and cluster management capabilities. Concerns about security, availability, and automation tend to crop up when companies move their 10 to 20-node Hadoop proof of concept (POC) cluster into production cluster with perhaps 100 to 200 nodes.
There’s a lot of low-hanging fruit for companies like Zettaset to grab as the Hadoop adoption curve begins to bend upward and to the right. “The funny thing about this market is everybody thinks this stuff is turnkey and it works–just upload it and use it,” Vogt says. “The real truth is a lot of this stuff is still not very automated.”
The vast majority of Hadoop clusters in the world are POCs at the moment. Recent research of out Wikibon suggest that, out of 6,500 Hadoop clusters around the world, perhaps there are only 500 in production with a paid Hadoop license. Zettaset and Dataguise each work with a number of POCs and are angling to get paid when they go live.
“There plenty to be done on the security side, quite frankly,” says Vogt, who is on his third security company. Security “is a living organism and continually needs to be adjusted and enhanced. Encryption is a life science onto its own and we think there’s plenty of meat there that’s going to take a lot of time to get into open source.”
Dataguise’s Singh is optimistic too, especially as people become more aware of Hadoop’s security limitations. “The mindset is changing,” he says. “People are giving consideration to security upfront. That was not happening six months back, but now we’re getting called in” at the beginning of the Hadoop planning stage.
Some of the best big data use cases involve companies in the retail, healthcare, and financial services industries. But these are among the most highly regulated areas of business, with regulations like PCI, HIPAA, Basel 3, and Dodd-Frank governing what companies can and can’t do with their data.
The first data security challenge companies face in Hadoop is figuring out what data they need to protect. Next, they need to decide what users should access to what pieces of data. Lastly, the companies decide what actual mechanisms to use.
Singh says Dataguise’s claim to fame is its capability to provide role-based access to protected data within Hadoop, and to protect data using either encryption or masking.
“Companies want to mask or encrypt the data either before the data makes it [into Hadoop], or while it’s in there, and decrypt it or unmask based on the user access,” he says. “If User A is looking at it, I want to decrypt it, but if User B is looking at it, he should not have access to it. And if there’s unauthorized user, he should be pointed at a cluster that has a masked copy of the data.”
If Hadoop is going to be the platform of choice for storing big data, then it needs to be secure. While Hadoop is not secure out of the box, there are plenty of tools available to help you achieve data security. They key is to take data security into account before beginning a Hadoop implementation, rather than rushing to implement it after you discover the vulnerability–or worse, after you’ve been breached.