by
Jules S. Damji
Few years ago, I was binging on TED Talks when I stumbled upon John Maeda’s talk “Designing for Simplicity.” An artist, a technologist, and an advocate for design in simplicity, he wrote ten laws that govern simplicity. I embraced those laws; I changed my e-mail signature to “The Best Ideas Are Simple”; and my business cards echo the same motto. And to some extent, the Continuuity Reactor’s four building blocks adhere to at least six out of the ten laws of simplicity.
In the last blog, I discussed the four simple building blocks for a big data application and what their equivalent operations are on big data—collect, process, store, and query. In this blog, I dive deep into how, using Continuuity Reactor SDK, I implement a big data application. The table below shows the equivalency between big data operations and Continuuity Reactor’s building blocks.
Operations as Logical Building Blocks
But first, let’s explore the problem we want to solve, and then use the building blocks to build an application. For illustration, I’ve curtailed the problem to a small data set; however, the application can equally scale to handle large data sets from live streams or log files, and with increased frequency.
Problem
A mobile app wants to display the minimum and maximum temperature of a city in California for a given day. Your backend infrastructure captures live temperature readings every 1/2 hour from all the cities in California. For this blog, we limit the temperature readings for only seven days. (In real life, this could be for all the cities, in all countries, around the world, everyday, every year—that is lots of data. Additionally, weekly, monthly, and yearly averages and trends could be calculated in real time or batch mode.) The captured data are stored in and read from log files. However, it could easily be read or ingested from a live endpoint, such as http://www.weatherchannel.com.
The TemperatureApp big data application uses Streams to ingest data, transforms data in the Flow & Flowlets, stores data into Datasets, and responds mobile queries in the Procedures. Below is the application architecture depicting all the various building blocks interconnected.
Four Building Blocks = Unified Big Data Application
As I indicated in the previous blog, the application unifies all the four building blocks. It’s the mother of everything—it’s the glue that defines, specifies, configures, and binds all the four building blocks into a single unit of execution.
In one Java file, you easily and seamlessly define the main application. In my case, I’ll code our big data main app into a single file TemperatureApp.java. You simply implement a Java interface Application and its configure() method. For example, in this listing I implement the interface.
Ten Java lines of specification and definition code in the configure() method pretty much defines and configures the building blocks of this big data app as a single unit of execution—a builder pattern and its intuitive methods make it simple!
Building Block 1: Streams
Think of Streams as entities that allow you to ingest large, raw datasets into your system, in real time or batch mode. With the Java API, you can define Streams easily. Just create them with a single API Call.
import com.continuuity.api.data.stream.Stream;Stream logStream = new Stream(“logStream”);Stream sensorStream = new Stream(“sensorStream”);
Another way to define, specify, and configure Streams is within the main application, as shown above. An argument (or a name) to the Stream constructor uniquely defines a Stream within an application. Associated with Streams are events, which are generated as data are ingested into a Stream; Stream Events are then consumed by Flowlets within a Flow. To ensure guaranteed delivery to its consumers, namely Flowlets, these events are queued and persisted.
Building Block 2: Processors
Think of Flows & Flowlets as a Directed Acyclic Graph (DAG), where each node in the DAG is an individual Flowlet, the logic that will process, access, and store data into Datasets.
In our TemperatureApp.java above, I created a single Flow instance called RawFileFlow().Within this instance, I created two Flowlets—RawFileFlowlet() and RawTemperatureDataFlowlet()—and connected them to Streams. The code below shows how a Flow is implemented, how Flowlets are created, and how Streams are connected to Flowlets. And with just another ten lines of easy to read Java code in the configure() method, I defined another building block—again that is simple and intuitive!
For brevity, I have included only the RawFileFlowlet() source. You can download the TemperatureApp sources from the github and observe how both Flowlets are implemented. Also, note below in the source listing how Java @annotations are used to indicate what underlying building blocks and resources are accessed and used.
Building Block 3: Datasets
For seamless access to the underlying Hadoop ecosystems’ storage facilities, Datasets provide high-level abstractions to various types of tables. The relevant tables’ Java API provides high-level read-write-update operations, without your knowledge of how data is stored or where it’s distributed in the cluster. You simply read, write, and modify datasets with simple access methods. All the heavy weight lifting is delegated to the Continuuity Reactor application platform, which in turn interacts with an underlying Hadoop ecosystem.
For example, in this application, I use a KeyValueTable dataset, which is implemented as an HBase hash map table on top of HDFS. The Reactor Java API abstracts and shields all underlying complex HBase and HDFS interactions from you. You simply create a table, read, write, and update datasets, using the instance methods. In the above flowlets, RawFileFlowlet() and RawTemperatureDataFlowlet(), I use these high-level operations to store data. Additionally, these flowlets could also store temperature data in a TimeSeriesTable so that daily, weekly, monthly, and yearly average temperatures could be calculated for any city anywhere in the world at anytime.
Building Block 4: Procedures
Procedures allow external sources, such as mobile apps or web tools, to access datasets with synchronous calls. Most Procedures usually access transformed data, after they have gone through the Flow. A Procedure defines a method with arguments: a method name and list of arguments. These arguments are conveyed within a ProcedureRequest and the response to the query contained within a ProcedureResponse parameters respectively.
In our app, a mobile app sends the method name (getTemperature), a day of the week, and a city as procedural arguments. Through the ProcedureResponse, it returns the response or an error message.
Just like all other components, Procedures use Java @annotations to access underlying building blocks and resources. The listing below shows how our Procedure is implemented. And as with other building blocks we implemented above, Procedures can be instantiated as multiple instances, where each instance can run anywhere on the cluster—a simple way to scale!
What Now?
Well, first, in order for you to try this app, you’ll need to do the following:
- Download the Continuuity Reactor and SDK from here.
- Follow instructions and install the local Reactor on your laptop.
- Read the Developer’s guide and documentation.
- Try and deploy some examples. (The newer version has excellent real life examples.)
- Download TemperatureApp from github.
- cd to <git directory>/examples/java/temperatures
- ant
Second,
- Follow the instructions and guidelines in the Continuuity Reactor Quickstart guide for deploying apps.
- Deploy the TemperatureApp.jar into the local Reactor.
- Inject the directory data path to the temperature data files
- Run or start the Flow
- Make a query
What’s Next?
Hadoop 2.x ecosystem, with the YARN framework, from HortonWorks and Cloudera, has revolutionized and simplified building distributed applications. One is no longer limited only to MapReduce framework and paradigm. One can write any type of a distributed application taking advantage of the underlying Hadoop ecosystem.
I’ll explore some fine features in the next series.
As John Mead said that “Simplicity is about living life with more enjoyment and less pain,” I enjoyed programming on the Continuuity Reactor platform with more enjoyment and less pain or frustration—and I hope that you too if you take it for a spin.
Resources