Tag Archives: programming

4 Easy Building Blocks for a Big Data Application – Part II

by

Jules S. Damji

Few years ago, I was binging on TED Talks when I stumbled upon John Maeda’s talk “Designing for Simplicity.” An artist, a technologist, and an advocate for design in simplicity, he wrote ten laws that govern simplicity. I embraced those laws; I changed my e-mail signature to “The Best Ideas Are Simple”; and my business cards echo the same motto.  And to some extent, the Continuuity Reactor’s four building blocks adhere to at least  six out of the ten laws of simplicity.

Screen Shot 2014-03-13 at 4.35.53 PM

In the last blog, I discussed the four simple building blocks for a big data application and what their equivalent operations are on big data—collect, process, store, and query. In this blog, I dive deep into how, using Continuuity Reactor SDK, I implement a big data application. The table below shows the equivalency between big data operations and Continuuity Reactor’s building blocks.

Operations as Logical Building Blocks

Screen Shot 2014-03-13 at 1.45.21 PM

But first, let’s explore the problem we want to solve, and then use the building blocks to build an application. For illustration, I’ve curtailed the problem to a small data set; however, the application can equally scale to handle large data sets from live streams or log files, and with increased frequency.

Problem

A mobile app wants to display the minimum and maximum temperature of a city in California for a given day. Your backend infrastructure captures live temperature readings every 1/2 hour from all the cities in California. For this blog, we limit the temperature readings for only seven days. (In real life, this could be for all the cities, in all countries, around the world, everyday, every year—that is lots of data. Additionally, weekly, monthly, and yearly averages and  trends could be calculated in real time or batch mode.) The captured data are stored in and read from log files. However, it could easily be read or ingested from a live endpoint, such as http://www.weatherchannel.com.

The TemperatureApp big data application uses Streams to ingest data, transforms data in the Flow & Flowlets, stores data into Datasets, and responds mobile queries in the Procedures. Below is the application architecture depicting all the various building blocks interconnected.

Screen Shot 2014-03-12 at 1.07.08 PM

Four Building Blocks = Unified Big Data Application

As I indicated in the previous blog, the application unifies all the four building blocks. It’s the mother of everything—it’s the glue that defines, specifies, configures, and binds all the four building blocks into a single unit of execution.

In one Java file, you easily and seamlessly define the main application. In my case, I’ll code our big data main app into a single file TemperatureApp.java. You simply implement a Java interface Application and its configure() method. For example, in this listing I implement the interface.

Screen Shot 2014-03-12 at 4.12.21 PM

Ten Java lines of specification and definition code in the configure() method pretty much defines and configures the building blocks of this big data app as a single unit of execution—a builder pattern and its intuitive methods make it simple!

Building Block 1: Streams

Think of Streams as entities that allow you to ingest large, raw datasets into your system, in real time or batch mode. With the Java API, you can define Streams easily. Just create them with a single API Call.

import com.continuuity.api.data.stream.Stream;Stream logStream = new Stream(“logStream”);Stream sensorStream = new Stream(“sensorStream”);

Another way to define, specify, and configure Streams is within the main application, as shown above. An argument (or a name) to the Stream constructor uniquely defines a Stream within an application. Associated with Streams are events, which are generated as data are ingested into a Stream; Stream Events are then consumed by Flowlets within a Flow. To ensure guaranteed delivery to its consumers, namely Flowlets, these events are queued and persisted.

Building Block 2: Processors

Think of Flows & Flowlets as a Directed Acyclic Graph (DAG), where each node in the DAG is an individual Flowlet, the logic that will process, access, and store data into Datasets.

In our TemperatureApp.java above, I created a single Flow instance called RawFileFlow().Within this instance, I created two Flowlets—RawFileFlowlet() and RawTemperatureDataFlowlet()and connected them to Streams. The code below shows how a Flow is implemented, how Flowlets are created, and how Streams are connected to Flowlets. And with just another ten lines of easy to read Java code in the configure() method, I defined another building block—again that is simple and intuitive!

For brevity, I have included only the RawFileFlowlet() source. You can download the TemperatureApp sources from the github and observe how both Flowlets are implemented. Also, note below in the source listing how Java @annotations are used to indicate what underlying building blocks and resources are accessed and used.

Screen Shot 2014-03-13 at 2.20.10 PM

Building Block 3: Datasets

For seamless access to the underlying Hadoop ecosystems’ storage facilities, Datasets provide high-level abstractions to various types of tables. The relevant tables’ Java API provides high-level read-write-update operations, without your knowledge of how data is stored or where it’s distributed in the cluster. You simply read, write, and modify datasets with simple access methods. All the heavy weight lifting is delegated to the Continuuity Reactor application platform, which in turn interacts with an underlying Hadoop ecosystem.

For example, in this application, I use a KeyValueTable dataset, which is implemented as an HBase hash map table on top of HDFS. The Reactor Java API abstracts and shields all underlying complex HBase and HDFS interactions from you. You simply create a table, read, write, and update datasets, using the instance methods. In the above flowlets, RawFileFlowlet() and RawTemperatureDataFlowlet(), I use these high-level operations to store data. Additionally, these flowlets could also store temperature data in a TimeSeriesTable so that daily, weekly, monthly, and yearly average temperatures could be calculated for any city anywhere in the world at anytime.

Building Block 4: Procedures

Procedures allow external sources, such as mobile apps or web tools, to access datasets with synchronous calls. Most Procedures usually access transformed data, after they have gone through the Flow. A Procedure defines a method with arguments: a method name and list of arguments. These arguments are conveyed within a ProcedureRequest and the response to the query contained within a ProcedureResponse parameters respectively.

In our app, a mobile app sends the method name (getTemperature), a day of the week, and a city as procedural arguments. Through the ProcedureResponse, it returns the response or an error message.

Just like all other components, Procedures use Java @annotations to access underlying building blocks and resources. The listing below shows how our Procedure is implemented. And as with other building blocks we implemented above, Procedures can be instantiated as multiple instances, where each instance can run anywhere on the cluster—a simple way to scale!

Screen Shot 2014-03-13 at 2.22.09 PM

What Now?

Well, first, in order for you to try this app, you’ll need to do the following:

  1. Download the Continuuity Reactor and SDK from here.
  2. Follow instructions and install the local Reactor on your laptop.
  3. Read the Developer’s guide and documentation.
  4. Try and deploy some examples. (The newer version has excellent real life examples.)
  5. Download TemperatureApp from github.
  6. cd to <git directory>/examples/java/temperatures
  7. ant

Second,

  1. Follow the instructions and guidelines in the Continuuity Reactor Quickstart guide for deploying apps.
  2. Deploy the TemperatureApp.jar into the local Reactor.
  3. Screen Shot 2014-03-13 at 11.17.41 AM
  4. Inject the directory data  path to the temperature data files
  5. Screen Shot 2014-03-13 at 11.13.00 AM
  6. Run or start the Flow
  7. Screen Shot 2014-03-13 at 10.59.55 AM
  8. Make a query
  9. Screen Shot 2014-03-13 at 10.59.26 AM

What’s Next?

Hadoop 2.x ecosystem, with the YARN framework, from HortonWorks and Cloudera, has revolutionized and simplified building distributed applications. One is no longer limited only to MapReduce framework and paradigm. One can write any type of a distributed application taking advantage of the underlying Hadoop ecosystem.

I’ll explore some fine features in the next series.

As John Mead said that “Simplicity is about living life with more enjoyment and less pain,” I enjoyed programming on the  Continuuity Reactor platform with more enjoyment and less pain or frustration—and I hope that you too if you take it for a spin.

Resources

http://www.continuuity.com

http://www.continuuity.com/developers/

http://ubm.io/1jSBiuS

http://www.continuuity.com/developers/programming

Leave a comment

Filed under Big Data, Programming, Technology

4 Easy Building Blocks for a Big Data Application – Part I

by

Jules S. Damji

Engineering Mandate in the Old Hadoop World

Two years ago, before I moved to Washington DC, to pursue a postgraduate degree in Strategic Communication at Johns Hopkins University, I worked for a publishing software company, Ebrary (now part of ProQuest), as a principal software engineer. We digitized books for academic libraries, provided a platform as a service to university libraries so students and faculty could borrow, read, and annotate books on their mobile devices or laptops.

My director of engineering called an engineering strategy meeting. The small group of five senior engineers huddled around a small table in a fish bowl conference room.

Animated, he said, “We have a new mandate, folks. We need to scale our book digitization by ten-fold. We need to parallelize our digitization efforts so that we can digitize not 50 books a week, but five hundred or five thousand. As many as they [publishers] send us. We must revamp our infrastructure and adopt Hadoop clusters! ”

And, of course, adapt to a distributed, scalable, and reliable programming paradigm—do all the digitization in parallel, as MapReduce jobs, on a 50-node cluster.

The effort to install, administer, monitor, and manage an Apache Hadoop cluster, back then, was not only a herculean task, but a stupendous, steep learning curve. None of us had the operational experience; none of us, even remotely, had the MapReduce parallel programing experience, like the folks at Yahoo or Facebook or Google who had pioneered and heralded an enormously powerful and highly scalable programing model that accounted for distributed storage, distributed computation, distributed large data sets, and distributed scheduling and coordination. Our discovery process was disruptive, instructive, and illuminative.

We abandoned our efforts and settled for managed AWS and EC as our underlying infrastructure. Not because the Apache Hadoop ecosystem did not live up to its promise. (In fact, other companies with large engineering staff, such as Netflix, were the show case and social proof of Hadoop’s promise.) But because it was too complex to manage, and we were short of operational experience.

The New World of Hadoop and Big Data

But I wish I had to relive and retackle the same problem today. After two years away, I’m back in the Silicon Valley. And the world of Hadoop ecosystem has moved and improved at a lightning pace. Now we have “Big Data” companies that offer Infrastructure as a Service (IaaS), Platform as a Service (PaaS), and Cloud as a Service (CaaS). We have companies that have become the Red Hats of the Hadoop ecosystem. To name a notable few, Cloudera, Amazon, RacskSpace, MapR, HortonWorks etc—they have mitigated the pain (of you as a developer, an administrator, a dev-ops engineer, or a startup founder), burden, and headache of managing an Hadoop cluster and infrastructure; they have innovated additional components that complement and render the Hadoop ecosystem as a powerful distributed execution environment to handle, store, compute, aggregate, and process large datasets. In short, they have invented an Aspirin for the IT administrators’ headaches.

My former big boss at LoudCloud (Opsware) Marc Andressen once said at our all hands meeting, “We want to be the operating system of the Data Center.” In a similar way, these Big Data infrastructure companies have successfully become the operating systems and execution environments of a highly complex, immensely powerful, and incredibly fault-tolerant of a cluster of thousands of nodes.

So what’s the X-factor in the New Big Data World?

Yet one thing is missing: A developer-focused application platform that shields the complexity of the underlying Hadoop ecosystem, that offers an intuitive application programing interface as building blocks to distributed tasks, and that allows a developer to quickly and easily develop, deploy, and debug big data applications. What’s missing is a big data application server on which big data applications are deployed as a single unit of execution—a jar file, analogous to a war file in a J2EE application server.

Until now.

The new big data startup company Continuuity, Inc satisfies all the above desirable features. Its Reactor application platform shields the underlying Hadoop ecosystem. It runs as middleware on top of an existing Hadoop ecosystem. Its Reactor Java SDK abstracts complex big data operations as programmable building blocks. And its local Reactor allows developers to quickly develop, debug, and deploy their big data apps as jar files.

It’s the X-factor in the new Big Data world, offering developer tools to build big data applications. The Continuuity CEO and cofounder Jonathan Gray seems to think so: “We’ve built middleware, or an application server, on top of Hadoop.”

“We’re a Hadoop-focused company. What makes us different from everyone else is that we’re a developer-focused company,” said Gray.

The Big Data Operations and the Four Reactor Building Blocks

Invariably, engineers perform four distinct operations on data, traditionally known as extract, transform, and load (ETL). The engineers collect, process, store, and query data. Most programmers implement these ETL operations as batch jobs, which run overnight and massage the data. But that is no different today, except the scale is enormous and the need is immediate (in real time).

The Continuuity Reactor SDK offers these operations as programmable building blocks. They translate easily into comprehensive Java interfaces and high-level Java abstract classes, allowing the developer to extend them, implement new methods or override existing ones. These sets of classes can be used collectively to implement the traditional ETL operations as scalable, distributed, and real-time tasks, without you as a Java developer having the wherewithal of the underlying Hadoop ecosystem.

The goal is to empower the developer and to accelerate rapid development. A programmer focuses more on developing big data apps and less on administering the cluster of nodes. While the Continuuity application platform, interacting with the underlying Hadoop ecosystem, takes care of all the resource allocation and life-cycle execution, the software engineer focuses on programming the ETL operations on large data sets.

The table below shows the equivalency between big data operations and Reactor’s building blocks.

Operations as Logical Building Blocks

Screen Shot 2014-03-07 at 4.36.19 PM

Building Block 1: Streams

Think of Streams as entities that allow you to ingest large, raw datasets into your system, in real time or batch mode. With the Java API, you can define Streams. With the REST API, you can send data to a Stream. And Streams are then attached to another building block called Processors, which transform data.

Building Block 2: Processors

Processors are logical components composed of Flows, which comprise of Flowlets. A Flow represents a journey of how your ingested data are transformed within an application. Think of Flows & Flowlets as a Directed Acyclic Graph (DAG), where each node in the DAG is an individual Flowlet, the code and logic that will process, access, and store data into Datasets.

Building Block 3: Datasets

For seamless access to the underlying Hadoop ecosystems’ storage facilities, Datasets provide high-level abstractions to various types of tables. The relevant tables’ Java API provides high-level read-write-update operations, without your knowledge of how data is stored or where it’s distributed in the cluster. You simply read, write, and modify datasets with simple access methods. All the heavy weight lifting is delegated to the Continuuity Reactor application platform, which in turn interacts with an underlying Hadoop ecosystem.

Building Block 4: Procedures

Naturally, after you have transformed and stored your data, you will want to query it from external sources. Which is where Procedures, through its Java interfaces and classes, provide you the capability to query your data, to scale your queries over multiple instances of Procedures, and to gather datasets and combine them into a single dataset. Procedures, like Flowlets, can access Datasets, using the same access methods.

Four Building Blocks = Unified Big Data Application

It’s the mother of everything—it’s the glue that configures, binds, and combines all the four building blocks into a single unit of execution.

Conceptually, an anatomy of a big data application from the Continuuity Reactor’s point of view is simple—it’s made up of the four building blocks, implemented as Java classes and deployed as a single unit of execution, a jar file. And this notion of simplicity pervades and prevails throughout the Reactor’s Java classes and interfaces.

An Anatomy of a Big Data Application

blog_1_diag_1

The digram above depicts a logical composite of a Reactor big data application. Data are injected from external sources into Streams, which are then connected to Processors—made up of Flows & Flowlets—which access Datasets. Procedures satisfy external queries. Note that each building block, once implemented as Java class objects, can have multiple instances, and each instance can then run on a different node in the cluster offering massive scalability.

So What Now?

In this blog, I introduced high-level concepts pertaining to Continuuity Reactor (in particular) and relating them to big data operations (in general). In the next blog, I’ll explore the mechanics of writing a big data application using the four Continuuity Reactor building blocks with their associated Java interfaces and classes.

Meanwhile, if you’re curious, peruse the Continuuity developer website, download the SDK, deploy the sample apps, read the well-written developer’s guide, and feel the joy of its simplicity, without worrying about the underlying Hadoop complexity.

It may surprise you, even delight you, as it did me!

Resources

  1. http://www.continuuity.com
  2. http://www.continuuity.com/developers/
  3. http://ubm.io/1jSBiuS
  4. http://www.continuuity.com/developers/programming

1 Comment

Filed under Big Data, Programming, Technology