Sharing is Important for Big Data Analytics

Recently I had the pleasure of sitting down with SAS vice president of Big Data, Paul Kent to get his thoughts on the vibrant Apache Hadoop ecosystem, and discuss how SAS and Cloudera customers are benefitting from our joint engineering work. The following blog includes excerpts from that conversation:

David: When you think about Hadoop, what’s the first thing that pops into your mind?

Paul: It’s a place where unlimited data is available, and people are free to mix and match data as they hunt for patterns that can improve their customer service, catch more bad guys, or innovate products more quickly. It’s a place where data is shared across disparate groups and organizations that benefit from having access to their data at a scale they had only imagined previously.

Companies and people who have implemented Hadoop successfully, and there are a growing number of these, are early pioneers who took a bet on a promising technology platform and started to modernize their analytics environment in that direction. These early winners have become adept at sharing across several dimensions:            

  1. Sharing the data
  2. Sharing the processing potential
  3. Sharing the new world with people whose habits are formed in the old one

David: Can you expound a bit more on this idea of sharing?

Paul: Yeah. Different users of the business had different source systems. That’s what made it hard to unify the data. Enterprise data warehouses became popular because companies would have their sales data in one system and their production data in another; and maybe they acquired another company and that other line of business’s production data is in yet another system. So the EDW (yes, another disparate platform) was the platform that unified all that data. The problem, however was that the EDW was so inflexible and detached that it became yet another silo. And up comes Hadoop, and it’s more flexible. It can tolerate adding new sources more quickly.

Kudu is another great example of sharing within the Hadoop community. While HDFS may be good at A and B, but not so good at C, you will see the community embracing alternatives like Kudu. Sharing is the idea that if a particular file system isn’t cutting it for certain use cases, the open source community innovates and says, let’s try a better one.

David: Is Hadoop the biggest change you’ve seen to analytics over the past five years?

Paul: I think with Hadoop, there are really two things happening, and both are significant from an analytics perspective. First, there’s the replatforming. Five to seven years ago, this was expensive – big iron Unix boxes, pSeries from IBM, or Itanium from HP. And then there were multimillion dollar SAN/NAS storage, and people were happy to get 500 terabytes of storage. Hadoop cuts that down by a huge factor because now companies don’t need to buy expensive mainframe-class boxes. Instead they just buy lots of little Intel boxes. And they’re not paying out the ear for storage.

The other thing that’s changed is that with cloud and the Internet of Things, so much more is measured these days, and that is useful data that people are starting to get serious about looking for patterns in more and more diverse datasets. So there’s this appetite for doing analytics at a much larger scale. I don’t know which came first, that analysts wanted to or that they could.

David: Yeah. Kind of a chicken and egg scenario. What are some interesting ways SAS customers are using Hadoop today?

Paul: I go back to the sharing idea. At one of our mutual banking customers, the financial crimes unit first had this mandate to collect data from all systems so they could look for bad guys. But the marketing people wanted the same set of data to mine their customers so they could deliver better offers. That’s when departments start to realize. “Oh, we’re looking at the same data. Why not join forces and build one big system instead of two duplicate systems?” Now certainly the financial crimes guys will have data they’re allowed to read that the marketing team cannot.

We’re helping organizations like these mature in their thinking for how to share the union of these two ideas instead of building them independently and wasting cycles. The ability to do this cost-effectively is still relatively new. As you know, our previous tendency was, “Oh yeah, we have a project. Let’s have a system just for that.”

David: You touched a bit on some of the challenges this shared data leads to, namely how do you grant and restrict access to data.

Paul: It’s the “victims of your own success” problem. When you get good at sharing, you’re going to need to be as good or better at dealing with security and access controls. It stands to reason that if you put data in one place and start to share more, you make it more of an attractive target. You do have to be more rigorous about who shares what, and secure and encrypt the things you don’t mean to share.

Related: Learn more about Cloudera’s Information Security solutions for Hadoop

David: Last question. SAS has been working closely with Cloudera and the ecosystem at large to ensure its enterprise solutions are tightly integrated into the Hadoop fabric. Can you talk about some of the progress on that front?

Paul: We want customers who run SAS against more traditional systems to be able to run against Hadoop in much the same way, while taking advantages of many of the efficiencies and performance benefits we’ve already talked about. We already have a number of Hadoop users running our Access Engines, VA, VS, and Data Loader for Hadoop.  We recently announced SAS Grid Manager for Hadoop which takes advantage of YARN for mediating access to compute and memory.

Consider a formerly best-in-class analytics platform.  High-end UNIX servers and robust SAN/NAS storage.  The usage pattern is primarily running many single process workloads (SAS, R, Python) against a single logical file-system (although many SAN and NAS servers will aggregate this storage over many spindles and storage heads)

  • Could you run these same programs, but redirect the storage to HDFS which has super cost advantages over traditional SAN/NAS infrastructure?
  • Could you run these processes on the same commodity Linux/X86 servers typically used to build your Hadoop cluster?

What better way to bring the folks from the old world into the new – give them an experience in the new world that approximates their previous generation of tools. They are free to try the new capabilities on an incremental basis without suffering the downtime caused by a complete re-implementation.

SAS Grid Manager for Hadoop is our strategy to allow you to “teleport” your existing analytics workloads to the cost friendly dynamics of the Hadoop ecosystem.

David: Thanks Paul. I appreciate your time.

Related Resources

The post Sharing is Important for Big Data Analytics appeared first on Cloudera VISION.

Leave a Comment

Your email address will not be published. Required fields are marked *