The recent news from Wall Street about the merger of two of the original “big data” companies, Cloudera and Hortonworks (now called “Cloudera”) made a lot of noise across the Big Tech Finance marketplace, certainly here in Silicon Valley. But…
The recent news from Wall Street about the merger of two of the original “big data” companies, Cloudera and Hortonworks (now called “Cloudera”) made a lot of noise across the Big Tech Finance marketplace, certainly here in Silicon Valley. But the real inflection point was six months ago, with the release of Apache Spark 2.3 and its native Kubernetes support. Thanks to this new release, everyone who is trying to make data make sense, from startups to enterprise companies and data scientist everywhere, has in Kubernetes a potent new tool to get the job done. It means that the cloud-native strategies that have made cloud computing so compelling — elastic capacity, modularized microservices, faster incremental development — get this benefit through Kubernetes, with first-class support for machine learning, data analytics, and of the broadest spectrum of data processing.
To help explain why this is such a big deal, let’s take a look at the background of these two megatrends: data science and Big Data. We’ll start with a look back at the rationale behind Spark and its architecture to shed light on what the big deal is. Even if you’ve never gotten your hands dirty with kubernetes, you need to know how these two great tastes go great together. (In another post, we will show you how you can make Kubernetes even more productive thanks to AWS).
How we got here
You can’t really say that big data is big news. Both of the supply of day, whether through cheap and widely available data output from mobile devices, RFID readers sensor Networks, or the huge pool of logs on the one hand and the other hand the rapidly accelerating what in the cost of data storage. It’s now to the point where there are not enough Greek letters to describe the orders of magnitude acceleration in the bottomless oceans of bits.
More importantly, processing that data requires more and more horsepower, with distributed software architectures that run on thousands of servers leading the edge of the tech arms race. Those gigantic server farms are pretty much a who’s who of mega-tech: Google, Facebook, Apple, Amazon, Microsoft, and anyone who wants to be like them.
Arguably, it started in 2004 when Google published some important computer science papers, among others on a framework called MapReduce. The idea was to generate data sets using a massively parallel distributed algorithm on clustered machines. 2 split up in two pieces, and then gather them and reduce them into a successive data set for additional processing.
Big data picked up steam when the Apache Open Source Foundation initiated a project called Hadoop. Hadoop had two parts, HDFS (the Hadoop file system) and a programming model which could crunch through the two parts of “map” and “reduce” again and again. Files get split into large blocks and distributed to a bunch of worker nodes in a cluster to process the data in parallel, and write the results back into the distributed storage system. It worked great for large batch jobs. But it could only work as fast as the I/O latency of the storage platform where it runs. No matter how fast reads and writes from disk can go, doing the processing in-memory is faster. And it was to that end that the Apache Spark project was created.
Spark has a data engine that is intended to be both general purpose and runs in memory, working off a construct called the Resilient Distributed Dataset (RDD), a read-only set of data points spread out across server Clusters. In the original 2009 paper, the team at UC Berkeley who developed it argued that Spark was two orders of magnitude faster than Hadoop, and was even 10x faster processing disk-based data then using MapReduce. Like all benchmarks, it involved some creativity, but it was still pretty impressive to a lot of developers looking for a new, more nimble ways to crack the code on data processing. (And let’s face it, writing and maintaining data processing architecture in Hadoop was a pretty heavy lift, even running it on JBOD did a lot to lower the cost of storage-heavy applications).
Rethinking distributed data architecture
Parallel processing is better than serial processing. In MapReduce, distributing data around a cluster of separate machines and crunching through the algorithms by pushing the code to the data instead of vice versa was genuinely a breakthrough idea. Spark takes this concept to the next level with the following components:
- Spark Core:
- This is where Spark handles the basic machinery of distributed data processing: scheduling, task dispatching, and IO operations, with an API built on the construct of the RTD. Think of Spark Core as a driver: you invoke a variety of operations in parallel, such as a map, a filter, a reduce, or other operations on an RDD. Spark event takes care of getting the work done in parallel across the cluster. The APIs are much more digestible, both because of their native integration and because you can work in memory interactively, rather than having to push to disk whenever you want to do work. The five native Core libraries (R, SQL, Python, Scala, and Java) give you a very broad palette of tools to work with.
- Spark Streaming:
- if the limitation to big batches was what made Hadoop ungainly, this feature is probably the slickest part of Spark. It ingests data in mini-or even micro-batches, to perform operations and transformations on RDDs. What’s particularly handy is that you can use the same application code on either batch Mode work or on continuous streaming analytics. This is what’s known as a Lambda architecture (not to be confused with the AWS serverless code function also called Lambda)
- Spark Machine Learning Library (MLlib):
- Another nice feature of Spark is its enormous selection of pre-built analytics algorithms, which are great to run in parallel. This is a big headache in Hadoop because you usually had to write it in Java, and then execute in batch mode. Spark libraries offer both statistics transformations (think summary statistics, correlations, stratified sampling, hypothesis testing ), as well as more sophisticated operations typical of machine learning (classification, collaborative filtering techniques, cluster analysis methods, dimensionality reduction techniques, feature extraction, and optimization algorithms).
- Spark SQL:
- This is a component on top of Spark Core. It creates a semantic data abstraction with support for either structured (e.g., tables) or semi-structured (e.g., JSON) data, which Spark calls DataFrames. The way it works is you get SQL language support with command-line interfaces and an ODBC/JDBC server, implemented via a DSL domain-specific language for manipulating the DataFrame construct in Scala, Python or Java.
- GraphX:
- An important complement to Classic structured data analytics is graph processing, which is why Spark gives you GraphX graph analysis engine for those kinds of analytics all groups. As you recall, RDDs are read-only, so GraphX really won’t work for graph analytics on data that’s being updated. There are a couple of separate APIs that you can use for massively parallel graph algorithms (such as the like the classic PageRank).
All in all, this is a pretty robust set of cool libraries and resources to unleash on your data. Next, you are going to need some kind of distributed storage system we’re all that data lives — in addition to its compatibility with HDFS, you can choose from the MapR file system, Apache Cassandra, OpenStack Swift Object Store, and our favorite, Amazon S3. Cluster management for distributing processing work is also going to need your attention, and in addition to its own native Spark cluster mechanism, you could use either Hadoop YARN or Apache Mesos.
Here’s the catch: classically, these kinds of data processing workloads were built out with dedicated setups, as did Hadoop with job management by YARN. The rest of your cloud’s working nodes had to have their own separate control plane; more fragmented management, administration less flexibility in resource management, etc. That is, until now.
The power of container orchestration for cloud-based data processing
Though the recent history of cloud architectures spawned a number of competing alternatives, Kubernetes has clearly emerged as the most popular and most preferred container orchestration engine. It builds on the flexibility and resilience of containers for packaging and running your applications in the cloud, and enhances them with smarter, more complete and flexible resource management, multi-tenancy, and so forth. True, it was not impossible to have Kubernetes look after distributed data processing tasks. However, the integration with Apache Spark makes this much, much easier.
For starters, Apache Spark workloads use Kubernetes clusters, and their services for multi-tenancy, namespaces, quotas, and so on. There are also some significant benefits, such as a modular approach to authorization (" plug) authorization") and more manageable logging ( so that you can treat the logs from the workload managed by the cluster as a single string rather than fragmented OS logs from dicker.
Here’s another bonus: this new Spark upgrade doesn’t require that you make significant changes or install new container management infrastructure in your Kubernetes cluster. All you have to do is build a container image, set up access controls for the Spark application, and you’re ready to go.
From the perspective of the Kubernetes cluster, the native Spark application behaves like a custom controller, managing Kubernetes resources based on requests issued from the Spark scheduler. This can be more useful than running Spark in standalone mode on Kubernetes because you benefit from the cluster’s management of resource elasticity, plus its logging and monitoring. Generally, this is a more well-behaved approach to clustered container orchestration, better yet, it’s consistent with all the rest of your containerized cloud workloads managed by Kubernetes.
It should look something like this.
- Use spark-submit to push a Spark application directly to a Kubernetes cluster.
- This then causes Spark to create a Spark driver that runs within a Kubernetes pod
- The driver then spins up “executors” as Kubernetes pods; the driver connects to the “executors” and runs the Spark app code
- When the app is done, the executor pods terminate and get cleaned up. However, the Spark driver does not go away. Rather, Kubernetes keeps it, along with its logs and other context as “completed” and addressable in the Kubernetes until that operational data is collected or otherwise cleaned up.
The way Kubernetes works is that it requires users to supply images to deploy into containers within pods. it’s expected that the images are built to be run in a supported container runtime environment. Spark now ships with a script that can build a Docker image and push it to the Kubernetes backend.
Cloud native data processing now works hand-in-glove. Kubernetes already had real standout strength in cluster auto-scaling, scheduling, clean up across hybrid environments. Integration of native Spark support makes the two of them even more powerful together. It’s arguably the most powerful and flexible data processing you can get, with all the versatility you could want.
The big picture is that versatility is what really breaks the barrier between the “big” in big data and data science. The interactive, iterative, experimental bias of data science depends on making it easier and faster to adapt. Kubernetes delivers the fastest, best way to do it yet.