Tag Archives: bigdata

Apache Spark on Hadoop: Learn, Try and Do

Not a day passes when someone tweets or re-tweets a blog on the virtues of Apache Spark. Not a week passes when an analyst either examines the implications or hype of Apache Spark in the big data landscape. And not a month passes when one fails to hear an expressive effusion among advocate’s spoken words about Apache Spark at Meetups.

At a Memorial Day BBQ, an old friend proclaimed:

Spark is the new rub, just as Java was two decades ago. It’s a developers’ delight.

Spark as a distributed data processing and computing platform offers much of what developers’ desire and delight—and much more. To the ETL application developer Spark offers expressive APIs for transforming data and creating data pipelines; to the data scientists it offers machine libraries; and to data analyst it offers SQL capabilities for queries.

In this blog, I summarize how you can get started, enjoy Spark’s delight, and commence on a quick journey to Learn, Try, and Do Spark on Open Enterprise Hadoop, with a set of tutorials. But first a peek to the past…

Look to the past

Incidentally, 20 years ago, Java’s Dancing Duke turned 20. I could not help but reflect similar effusions among advocates and similar skepticism among analysts. Then I was working at Sun Microsystems during Java’s formative stages.

duke

The delight among Java developers to engage with expressive and extensible Java APIs—for utilities, data structures, threading, networking, and IO—was palpable; the Javadocs were refreshing; and the buzz infectious.

Because Java language specification for a Java Virtual Machine (JVM) and the Java APIs abstract the lower-level operating system on a targeted platform, Java developers worry less about complexity of execution on the targeted platform and concentrate more on how they operate on and transform data structures through concrete classes and access methods.

Similarly,  big data developers seem to embrace Spark with equal passion and verve. They enjoy its expressive APIs; they like its functional programming capabilities; they delight in its simplicity—on concepts such as RDDs, transformations, and actions; on additional components that run atop the Spark Core. They expend less energy on the low-level Hadoop complexity and spend more on high-level data transformation, through iterative operations on datasets expressed as RDDs.

Apples and Oranges?

True, while Java is a programming language and Spark is a distributed data processing and computing engine, I could be accused of comparing apples and oranges. But I’m not comparing functionality per se. Nor am I comparing core capabilities or design principles in particular. Java is an extensible language for writing web, enterprise applications and distributed systems, whereas Spark is an extensible ecosystem for distributed data processing.

Rather, I’m comparing their allure and attraction among developers; I’m underscoring what motivates developers’ desire and delight (or inflames their resistance and rebuke) to embrace a language or platform: it is ease of use; ease of development and deployment; unified APIs or SDKs, in languages endearing and familiar to them.

Spark offers that—and much more. With each rapid release, new features accrue. Much of the road map we will hear at the Spark Summit 2015 this week.

In the present

Spark on Apache Hadoop YARN enables deep integration with Hadoop and allows developers and data scientists two modes of development and deployment on Hortonworks Data Platform (HDP).

In the local mode—running on a single node, on an HDP Sandbox—you can get started using a set of tutorials put together by my colleagues Saptak Sen and Ram Sriharsha.

  1. Hands on Tour with Apache Spark in Five Minutes. Besides introducing basic Apache Spark concepts, this tutorial demonstrates how to use Spark shell with Python. Often, simplicity does not preclude profundity. In this simple example, a lot is happening behind the scenes and under the hood but it’s hidden from the developer using an interactive Spark shell. If you are a Python developer and have used Python shell, you’ll appreciate the interactive PySpark shell.
  2. Interacting with Data on HDP using Scala and Apache Spark. Building on the concepts introduced in the first tutorial, this tutorial explores how to use Spark with a Scala shell to read data from an HDFS file, perform in-memory transformations and iterations on a RDD, iterate over results, and then display them inside the shell.
  3. Using Apache Hive with ORC from Apache Spark. While the last two tutorials explore reading data from HDFS and computing in-memory, this tutorial shows how to persist data as Apache Hive tables in ORC format and how to use SchemaRDD and Dataframes. Additionally, it shows how to query Hive tables using Spark SQL.
  4. Introduction to Data Science and Apache Spark. Data scientists use data exploration and visualization to confirm a hypothesis. They use machine learning to derive insights. As first part of a series, this introductory tutorial shows how to build, configure and use Apache Zeppelin and Spark on an HDP cluster.

 Our commitment to Apache Spark is to ensure it’s YARN-enabled and enterprise-ready with security, governance, and operations, allowing deep integration with Hadoop and other YARN enabled workloads in the enterprise—all running under the same Hadoop cluster, all accessing the same dataset. At the Spark Summit 2015 today, Arun Murthy, co-founder of Hortonworks, summed up why Hortonworks loves Spark.

Conclusion

As James Gosling put it 20 years ago in The Feel of Java:

By and large, it [Java] feels like you can just sit down and write code.

I feel, and presumably many others do so with Spark, that you can just sit down, fire up a REPL (Scala or PySpark), prototype, interact, experiment, and visualize results quickly.

Spark has that feel to it; its APIs have that expressive and extensible nature; its abstraction and concepts eclipse the myriad complexities of underlying execution details. Far more importantly, it’s a developers’ delight.

Learn More

Leave a comment

Filed under hadoop, spark

4 Easy Building Blocks for a Big Data Application – Part II

by

Jules S. Damji

Few years ago, I was binging on TED Talks when I stumbled upon John Maeda’s talk “Designing for Simplicity.” An artist, a technologist, and an advocate for design in simplicity, he wrote ten laws that govern simplicity. I embraced those laws; I changed my e-mail signature to “The Best Ideas Are Simple”; and my business cards echo the same motto.  And to some extent, the Continuuity Reactor’s four building blocks adhere to at least  six out of the ten laws of simplicity.

Screen Shot 2014-03-13 at 4.35.53 PM

In the last blog, I discussed the four simple building blocks for a big data application and what their equivalent operations are on big data—collect, process, store, and query. In this blog, I dive deep into how, using Continuuity Reactor SDK, I implement a big data application. The table below shows the equivalency between big data operations and Continuuity Reactor’s building blocks.

Operations as Logical Building Blocks

Screen Shot 2014-03-13 at 1.45.21 PM

But first, let’s explore the problem we want to solve, and then use the building blocks to build an application. For illustration, I’ve curtailed the problem to a small data set; however, the application can equally scale to handle large data sets from live streams or log files, and with increased frequency.

Problem

A mobile app wants to display the minimum and maximum temperature of a city in California for a given day. Your backend infrastructure captures live temperature readings every 1/2 hour from all the cities in California. For this blog, we limit the temperature readings for only seven days. (In real life, this could be for all the cities, in all countries, around the world, everyday, every year—that is lots of data. Additionally, weekly, monthly, and yearly averages and  trends could be calculated in real time or batch mode.) The captured data are stored in and read from log files. However, it could easily be read or ingested from a live endpoint, such as http://www.weatherchannel.com.

The TemperatureApp big data application uses Streams to ingest data, transforms data in the Flow & Flowlets, stores data into Datasets, and responds mobile queries in the Procedures. Below is the application architecture depicting all the various building blocks interconnected.

Screen Shot 2014-03-12 at 1.07.08 PM

Four Building Blocks = Unified Big Data Application

As I indicated in the previous blog, the application unifies all the four building blocks. It’s the mother of everything—it’s the glue that defines, specifies, configures, and binds all the four building blocks into a single unit of execution.

In one Java file, you easily and seamlessly define the main application. In my case, I’ll code our big data main app into a single file TemperatureApp.java. You simply implement a Java interface Application and its configure() method. For example, in this listing I implement the interface.

Screen Shot 2014-03-12 at 4.12.21 PM

Ten Java lines of specification and definition code in the configure() method pretty much defines and configures the building blocks of this big data app as a single unit of execution—a builder pattern and its intuitive methods make it simple!

Building Block 1: Streams

Think of Streams as entities that allow you to ingest large, raw datasets into your system, in real time or batch mode. With the Java API, you can define Streams easily. Just create them with a single API Call.

import com.continuuity.api.data.stream.Stream;Stream logStream = new Stream(“logStream”);Stream sensorStream = new Stream(“sensorStream”);

Another way to define, specify, and configure Streams is within the main application, as shown above. An argument (or a name) to the Stream constructor uniquely defines a Stream within an application. Associated with Streams are events, which are generated as data are ingested into a Stream; Stream Events are then consumed by Flowlets within a Flow. To ensure guaranteed delivery to its consumers, namely Flowlets, these events are queued and persisted.

Building Block 2: Processors

Think of Flows & Flowlets as a Directed Acyclic Graph (DAG), where each node in the DAG is an individual Flowlet, the logic that will process, access, and store data into Datasets.

In our TemperatureApp.java above, I created a single Flow instance called RawFileFlow().Within this instance, I created two Flowlets—RawFileFlowlet() and RawTemperatureDataFlowlet()and connected them to Streams. The code below shows how a Flow is implemented, how Flowlets are created, and how Streams are connected to Flowlets. And with just another ten lines of easy to read Java code in the configure() method, I defined another building block—again that is simple and intuitive!

For brevity, I have included only the RawFileFlowlet() source. You can download the TemperatureApp sources from the github and observe how both Flowlets are implemented. Also, note below in the source listing how Java @annotations are used to indicate what underlying building blocks and resources are accessed and used.

Screen Shot 2014-03-13 at 2.20.10 PM

Building Block 3: Datasets

For seamless access to the underlying Hadoop ecosystems’ storage facilities, Datasets provide high-level abstractions to various types of tables. The relevant tables’ Java API provides high-level read-write-update operations, without your knowledge of how data is stored or where it’s distributed in the cluster. You simply read, write, and modify datasets with simple access methods. All the heavy weight lifting is delegated to the Continuuity Reactor application platform, which in turn interacts with an underlying Hadoop ecosystem.

For example, in this application, I use a KeyValueTable dataset, which is implemented as an HBase hash map table on top of HDFS. The Reactor Java API abstracts and shields all underlying complex HBase and HDFS interactions from you. You simply create a table, read, write, and update datasets, using the instance methods. In the above flowlets, RawFileFlowlet() and RawTemperatureDataFlowlet(), I use these high-level operations to store data. Additionally, these flowlets could also store temperature data in a TimeSeriesTable so that daily, weekly, monthly, and yearly average temperatures could be calculated for any city anywhere in the world at anytime.

Building Block 4: Procedures

Procedures allow external sources, such as mobile apps or web tools, to access datasets with synchronous calls. Most Procedures usually access transformed data, after they have gone through the Flow. A Procedure defines a method with arguments: a method name and list of arguments. These arguments are conveyed within a ProcedureRequest and the response to the query contained within a ProcedureResponse parameters respectively.

In our app, a mobile app sends the method name (getTemperature), a day of the week, and a city as procedural arguments. Through the ProcedureResponse, it returns the response or an error message.

Just like all other components, Procedures use Java @annotations to access underlying building blocks and resources. The listing below shows how our Procedure is implemented. And as with other building blocks we implemented above, Procedures can be instantiated as multiple instances, where each instance can run anywhere on the cluster—a simple way to scale!

Screen Shot 2014-03-13 at 2.22.09 PM

What Now?

Well, first, in order for you to try this app, you’ll need to do the following:

  1. Download the Continuuity Reactor and SDK from here.
  2. Follow instructions and install the local Reactor on your laptop.
  3. Read the Developer’s guide and documentation.
  4. Try and deploy some examples. (The newer version has excellent real life examples.)
  5. Download TemperatureApp from github.
  6. cd to <git directory>/examples/java/temperatures
  7. ant

Second,

  1. Follow the instructions and guidelines in the Continuuity Reactor Quickstart guide for deploying apps.
  2. Deploy the TemperatureApp.jar into the local Reactor.
  3. Screen Shot 2014-03-13 at 11.17.41 AM
  4. Inject the directory data  path to the temperature data files
  5. Screen Shot 2014-03-13 at 11.13.00 AM
  6. Run or start the Flow
  7. Screen Shot 2014-03-13 at 10.59.55 AM
  8. Make a query
  9. Screen Shot 2014-03-13 at 10.59.26 AM

What’s Next?

Hadoop 2.x ecosystem, with the YARN framework, from HortonWorks and Cloudera, has revolutionized and simplified building distributed applications. One is no longer limited only to MapReduce framework and paradigm. One can write any type of a distributed application taking advantage of the underlying Hadoop ecosystem.

I’ll explore some fine features in the next series.

As John Mead said that “Simplicity is about living life with more enjoyment and less pain,” I enjoyed programming on the  Continuuity Reactor platform with more enjoyment and less pain or frustration—and I hope that you too if you take it for a spin.

Resources

http://www.continuuity.com

http://www.continuuity.com/developers/

http://ubm.io/1jSBiuS

http://www.continuuity.com/developers/programming

Leave a comment

Filed under Big Data, Programming, Technology

4 Easy Building Blocks for a Big Data Application – Part I

by

Jules S. Damji

Engineering Mandate in the Old Hadoop World

Two years ago, before I moved to Washington DC, to pursue a postgraduate degree in Strategic Communication at Johns Hopkins University, I worked for a publishing software company, Ebrary (now part of ProQuest), as a principal software engineer. We digitized books for academic libraries, provided a platform as a service to university libraries so students and faculty could borrow, read, and annotate books on their mobile devices or laptops.

My director of engineering called an engineering strategy meeting. The small group of five senior engineers huddled around a small table in a fish bowl conference room.

Animated, he said, “We have a new mandate, folks. We need to scale our book digitization by ten-fold. We need to parallelize our digitization efforts so that we can digitize not 50 books a week, but five hundred or five thousand. As many as they [publishers] send us. We must revamp our infrastructure and adopt Hadoop clusters! ”

And, of course, adapt to a distributed, scalable, and reliable programming paradigm—do all the digitization in parallel, as MapReduce jobs, on a 50-node cluster.

The effort to install, administer, monitor, and manage an Apache Hadoop cluster, back then, was not only a herculean task, but a stupendous, steep learning curve. None of us had the operational experience; none of us, even remotely, had the MapReduce parallel programing experience, like the folks at Yahoo or Facebook or Google who had pioneered and heralded an enormously powerful and highly scalable programing model that accounted for distributed storage, distributed computation, distributed large data sets, and distributed scheduling and coordination. Our discovery process was disruptive, instructive, and illuminative.

We abandoned our efforts and settled for managed AWS and EC as our underlying infrastructure. Not because the Apache Hadoop ecosystem did not live up to its promise. (In fact, other companies with large engineering staff, such as Netflix, were the show case and social proof of Hadoop’s promise.) But because it was too complex to manage, and we were short of operational experience.

The New World of Hadoop and Big Data

But I wish I had to relive and retackle the same problem today. After two years away, I’m back in the Silicon Valley. And the world of Hadoop ecosystem has moved and improved at a lightning pace. Now we have “Big Data” companies that offer Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Cloud as a Service (CaaS). We have companies that have become the Red Hats of the Hadoop ecosystem. To name a notable few, Cloudera, Amazon, RacskSpace, MapR, HortonWorks etc—they have mitigated the pain (of you as a developer, an administrator, a dev-ops engineer, or a startup founder), burden, and headache of managing an Hadoop cluster and infrastructure; they have innovated additional components that complement and render the Hadoop ecosystem as a powerful distributed execution environment to handle, store, compute, aggregate, and process large datasets. In short, they have invented an Aspirin for the IT administrators’ headaches.

My former big boss at LoudCloud (Opsware) Marc Andressen once said at our all hands meeting, “We want to be the operating system of the Data Center.” In a similar way, these Big Data infrastructure companies have successfully become the operating systems and execution environments of a highly complex, immensely powerful, and incredibly fault-tolerant of a cluster of thousands of nodes.

So what’s the X-factor in the New Big Data World?

Yet one thing is missing: A developer-focused application platform that shields the complexity of the underlying Hadoop ecosystem, that offers an intuitive application programing interface as building blocks to distributed tasks, and that allows a developer to quickly and easily develop, deploy, and debug big data applications. What’s missing is a big data application server on which big data applications are deployed as a single unit of execution—a jar file, analogous to a war file in a J2EE application server.

Until now.

The new big data startup company Continuuity, Inc satisfies all the above desirable features. Its Reactor application platform shields the underlying Hadoop ecosystem. It runs as middleware on top of an existing Hadoop ecosystem. Its Reactor Java SDK abstracts complex big data operations as programmable building blocks. And its local Reactor allows developers to quickly develop, debug, and deploy their big data apps as jar files.

It’s the X-factor in the new Big Data world, offering developer tools to build big data applications. The Continuuity CEO and cofounder Jonathan Gray seems to think so: “We’ve built middleware, or an application server, on top of Hadoop.”

“We’re a Hadoop-focused company. What makes us different from everyone else is that we’re a developer-focused company,” said Gray.

The Big Data Operations and the Four Reactor Building Blocks

Invariably, engineers perform four distinct operations on data, traditionally known as extract, transform, and load (ETL). The engineers collect, process, store, and query data. Most programmers implement these ETL operations as batch jobs, which run overnight and massage the data. But that is no different today, except the scale is enormous and the need is immediate (in real time).

The Continuuity Reactor SDK offers these operations as programmable building blocks. They translate easily into comprehensive Java interfaces and high-level Java abstract classes, allowing the developer to extend them, implement new methods or override existing ones. These sets of classes can be used collectively to implement the traditional ETL operations as scalable, distributed, and real-time tasks, without you as a Java developer having the wherewithal of the underlying Hadoop ecosystem.

The goal is to empower the developer and to accelerate rapid development. A programmer focuses more on developing big data apps and less on administering the cluster of nodes. While the Continuuity application platform, interacting with the underlying Hadoop ecosystem, takes care of all the resource allocation and life-cycle execution, the software engineer focuses on programming the ETL operations on large data sets.

The table below shows the equivalency between big data operations and Reactor’s building blocks.

Operations as Logical Building Blocks

Screen Shot 2014-03-07 at 4.36.19 PM

Building Block 1: Streams

Think of Streams as entities that allow you to ingest large, raw datasets into your system, in real time or batch mode. With the Java API, you can define Streams. With the REST API, you can send data to a Stream. And Streams are then attached to another building block called Processors, which transform data.

Building Block 2: Processors

Processors are logical components composed of Flows, which comprise of Flowlets. A Flow represents a journey of how your ingested data are transformed within an application. Think of Flows & Flowlets as a Directed Acyclic Graph (DAG), where each node in the DAG is an individual Flowlet, the code and logic that will process, access, and store data into Datasets.

Building Block 3: Datasets

For seamless access to the underlying Hadoop ecosystems’ storage facilities, Datasets provide high-level abstractions to various types of tables. The relevant tables’ Java API provides high-level read-write-update operations, without your knowledge of how data is stored or where it’s distributed in the cluster. You simply read, write, and modify datasets with simple access methods. All the heavy weight lifting is delegated to the Continuuity Reactor application platform, which in turn interacts with an underlying Hadoop ecosystem.

Building Block 4: Procedures

Naturally, after you have transformed and stored your data, you will want to query it from external sources. Which is where Procedures, through its Java interfaces and classes, provide you the capability to query your data, to scale your queries over multiple instances of Procedures, and to gather datasets and combine them into a single dataset. Procedures, like Flowlets, can access Datasets, using the same access methods.

Four Building Blocks = Unified Big Data Application

It’s the mother of everything—it’s the glue that configures, binds, and combines all the four building blocks into a single unit of execution.

Conceptually, an anatomy of a big data application from the Continuuity Reactor’s point of view is simple—it’s made up of the four building blocks, implemented as Java classes and deployed as a single unit of execution, a jar file. And this notion of simplicity pervades and prevails throughout the Reactor’s Java classes and interfaces.

An Anatomy of a Big Data Application

blog_1_diag_1

The digram above depicts a logical composite of a Reactor big data application. Data are injected from external sources into Streams, which are then connected to Processors—made up of Flows & Flowlets—which access Datasets. Procedures satisfy external queries. Note that each building block, once implemented as Java class objects, can have multiple instances, and each instance can then run on a different node in the cluster offering massive scalability.

So What Now?

In this blog, I introduced high-level concepts pertaining to Continuuity Reactor (in particular) and relating them to big data operations (in general). In the next blog, I’ll explore the mechanics of writing a big data application using the four Continuuity Reactor building blocks with their associated Java interfaces and classes.

Meanwhile, if you’re curious, peruse the Continuuity developer website, download the SDK, deploy the sample apps, read the well-written developer’s guide, and feel the joy of its simplicity, without worrying about the underlying Hadoop complexity.

It may surprise you, even delight you, as it did me!

Resources

  1. http://www.continuuity.com
  2. http://www.continuuity.com/developers/
  3. http://ubm.io/1jSBiuS
  4. http://www.continuuity.com/developers/programming

1 Comment

Filed under Big Data, Programming, Technology