Not a day passes when someone tweets or re-tweets a blog on the virtues of Apache Spark. Not a week passes when an analyst either examines the implications or hype of Apache Spark in the big data landscape. And not a month passes when one fails to hear an expressive effusion among advocate’s spoken words about Apache Spark at Meetups.
At a Memorial Day BBQ, an old friend proclaimed:
Spark is the new rub, just as Java was two decades ago. It’s a developers’ delight.
Spark as a distributed data processing and computing platform offers much of what developers’ desire and delight—and much more. To the ETL application developer Spark offers expressive APIs for transforming data and creating data pipelines; to the data scientists it offers machine libraries; and to data analyst it offers SQL capabilities for queries.
In this blog, I summarize how you can get started, enjoy Spark’s delight, and commence on a quick journey to Learn, Try, and Do Spark on Open Enterprise Hadoop, with a set of tutorials. But first a peek to the past…
Look to the past
Incidentally, 20 years ago, Java’s Dancing Duke turned 20. I could not help but reflect similar effusions among advocates and similar skepticism among analysts. Then I was working at Sun Microsystems during Java’s formative stages.
The delight among Java developers to engage with expressive and extensible Java APIs—for utilities, data structures, threading, networking, and IO—was palpable; the Javadocs were refreshing; and the buzz infectious.
Because Java language specification for a Java Virtual Machine (JVM) and the Java APIs abstract the lower-level operating system on a targeted platform, Java developers worry less about complexity of execution on the targeted platform and concentrate more on how they operate on and transform data structures through concrete classes and access methods.
Similarly, big data developers seem to embrace Spark with equal passion and verve. They enjoy its expressive APIs; they like its functional programming capabilities; they delight in its simplicity—on concepts such as RDDs, transformations, and actions; on additional components that run atop the Spark Core. They expend less energy on the low-level Hadoop complexity and spend more on high-level data transformation, through iterative operations on datasets expressed as RDDs.
Apples and Oranges?
True, while Java is a programming language and Spark is a distributed data processing and computing engine, I could be accused of comparing apples and oranges. But I’m not comparing functionality per se. Nor am I comparing core capabilities or design principles in particular. Java is an extensible language for writing web, enterprise applications and distributed systems, whereas Spark is an extensible ecosystem for distributed data processing.
Rather, I’m comparing their allure and attraction among developers; I’m underscoring what motivates developers’ desire and delight (or inflames their resistance and rebuke) to embrace a language or platform: it is ease of use; ease of development and deployment; unified APIs or SDKs, in languages endearing and familiar to them.
Spark offers that—and much more. With each rapid release, new features accrue. Much of the road map we will hear at the Spark Summit 2015 this week.
In the present
Spark on Apache Hadoop YARN enables deep integration with Hadoop and allows developers and data scientists two modes of development and deployment on Hortonworks Data Platform (HDP).
In the local mode—running on a single node, on an HDP Sandbox—you can get started using a set of tutorials put together by my colleagues Saptak Sen and Ram Sriharsha.
- Hands on Tour with Apache Spark in Five Minutes. Besides introducing basic Apache Spark concepts, this tutorial demonstrates how to use Spark shell with Python. Often, simplicity does not preclude profundity. In this simple example, a lot is happening behind the scenes and under the hood but it’s hidden from the developer using an interactive Spark shell. If you are a Python developer and have used Python shell, you’ll appreciate the interactive PySpark shell.
- Interacting with Data on HDP using Scala and Apache Spark. Building on the concepts introduced in the first tutorial, this tutorial explores how to use Spark with a Scala shell to read data from an HDFS file, perform in-memory transformations and iterations on a RDD, iterate over results, and then display them inside the shell.
- Using Apache Hive with ORC from Apache Spark. While the last two tutorials explore reading data from HDFS and computing in-memory, this tutorial shows how to persist data as Apache Hive tables in ORC format and how to use SchemaRDD and Dataframes. Additionally, it shows how to query Hive tables using Spark SQL.
- Introduction to Data Science and Apache Spark. Data scientists use data exploration and visualization to confirm a hypothesis. They use machine learning to derive insights. As first part of a series, this introductory tutorial shows how to build, configure and use Apache Zeppelin and Spark on an HDP cluster.
Our commitment to Apache Spark is to ensure it’s YARN-enabled and enterprise-ready with security, governance, and operations, allowing deep integration with Hadoop and other YARN enabled workloads in the enterprise—all running under the same Hadoop cluster, all accessing the same dataset. At the Spark Summit 2015 today, Arun Murthy, co-founder of Hortonworks, summed up why Hortonworks loves Spark.
Conclusion
As James Gosling put it 20 years ago in The Feel of Java:
By and large, it [Java] feels like you can just sit down and write code.
I feel, and presumably many others do so with Spark, that you can just sit down, fire up a REPL (Scala or PySpark), prototype, interact, experiment, and visualize results quickly.
Spark has that feel to it; its APIs have that expressive and extensible nature; its abstraction and concepts eclipse the myriad complexities of underlying execution details. Far more importantly, it’s a developers’ delight.
Learn More
- Visit the Apache Spark Page
- Visit our Apache Spark Project Page
- Learn, Try, and Do on developer.hortonworks.com