Post by Damian Knopp
I recently had the chance to work with Apache Storm and have learned a great deal in the past month. I hope to share some of that with you in this introductory post.
Apache Storm is a distributed streams processing framework. The development model is pretty simple to pick up and I have some sample code in github.
Here is the conceptual model;
1. A spout reads data from a streaming data source
2. A bolt processes the data
3. Bolts can be chained together, in much the same way as a chained Cascading or Hadoop Map Reduce job.
4. Tuples are the data passed from spouts and bolts, Tuples are named fields and are typed
5. Tuples are serialized in transit, Kyro is the default serializer
6. Processing of tuples are “acknowledged” as they flow thru the system. This action is call an “ack”
7. Vanilla Storm uses “at least once” semantics for processing its data. This gives it fault tolerance. Essentially this means, if a machine dies and no acks are sent for certain tuples, then Storm will reschedule work for those tuples
8. Storm comes with a Trident API for more transaction options including, “exactly once”, I did not use this API
9. Storm partitions work across “workers, executors, tasks”
10. A Worker is a JVM, Executor is a Thread, Task is work executed on an executor thread.
11. Nimbus is to Apache Storm what the Job Tracker is for Apache Hadoop
12. Workers and Executors are balanced across machines by a Supervisor process. Analogous to a Task Track in Hadoop
13. Bolts and Spouts have a groupings phase where by tuples are sent further up the processing chain. Different options are available for this grouping phase, including direct grouping to specific bolts, random grouping even distributed across all bolts, field group, key grouping, local grouping where by data is aggregated locally then sent up stream. In my mind this is similar to a combiner phase in Hadoop Map Reduce.
14. Storm works well with Kafka
15. Kafka has Topics which are like JMS Queues, Partitions which are like shards to parallelize reads and writes and offsets which are pointers to increment and rewind as you read or fail to read
So here is some sample code, notice it runs storm in local mode and does not need a cluster or storm installed to run
I would like to wrap up with a few notable points;
While Apache Storm is easy to get started; like many parallel processing systems, it can become difficult to debug quickly should things not work as expected, as was the case for our group. Still Apache Storm is a leading tool and seen as one of the measuring sticks for tools that do analysis and ingest of terabytes of data in near real time. Additionally I would like to point out, that one can write Scala in a sane way on projects and be maintainable. Lastly I would like to note that Apache Storm proved to play nicely in the Mesos resource sharing environment.