For Machine Learning Classifiers, A Grade of ‘A’ Isn’t Enough

Post by Dean Wetherby

It’s the end of the school year. After the professor applies a curve, you find out that you got an A in a pretty tough engineering class. It doesn’t matter if you got 100% or just barely eked a 90%. For classification tasks, an A is pretty good but not nearly good enough.

Lately I’ve been writing a few binary classifiers for aerial imaging applications. One example of a classifier is deciding if a frame of the video was taken above clouds which obscures the ground view. So I built a dataset consisting of cloudy and non-cloudy images appropriately split into training, testing and validation sets. I train my classifier which yields 95% accuracy on the test data. That’s pretty good, right? As you can probably guess from the preceding paragraph, it really isn’t.

 

Let’s say we’re processing an aerial video that’s 10 minutes long and has a frame rate of 30 frames per second. That’s 10 min * 60 sec/min * 30 frames/sec or 18,000 frames total. When we run our classifier on the video to detect cloudy and not-cloudy video frames, the classifier is going to incorrectly categorize approximately five percent of the time. This means some non-cloud images will be predicted to be cloudy and cloudy images will be predicted to be not cloudy. 5% of 18,000 frames is 900 frames! That means we are misclassifying a whole lot of frames with an ‘A’ grade classifier.

Although there are numerous approaches to increasing the accuracy/precision/recall of your model, I bring up this issue in order to manage machine learning expectations. Even if the classifier was adjusted somehow to be 99% accurate, we would still be incorrectly classifying 180 frames in our 10 minute video. Missing this many frames could make users frustrated with the classifier. So why not train the model to be 100% accurate? At some point, labeling the training images as cloudy and non-cloudy becomes subjective. In the case of the cloud predictor, what do we do in the case of partially cloudy images? How much of the frame should be covered in clouds before it is labeled cloudy? It’s this decision boundary that the model has difficulty with. If your classifier has a similarly challenging task, sometimes an ‘A’ is the best you can do.

Getting Started With Docker

Post by Michael Smolyak

Running Pre-build Docker Images

Running Docker containers requires Docker runtime to be installed on the target machine. Docker runs natively on Linux platform, but can be installed on Mac OS andWindows using VirtualBox and Boot2Docker tools.

Docker consists of Docker daemon and Docker client which may be installed on the same or different machines. In the latter case, Docker client connects to the daemon remotely. The Docker client talks to the Docker daemon, which does the heavy lifting of building, running, and distributing your Docker containers.

The simplest way to use Docker is to connect to a Docker registry (e.g. Docker Hub), download an existing Docker image (e.g., MongoDB) and run it using docker runcommand.

The first time you try to run an image, Docker will download it from the Docker registry. Subsequent startups will use the image cached locally. When instructed to run an image, Docker creates a container from that image and runs it. Docker images are read-only. Docker container is a runtime instance of a corresponding image (which adds an execution environment to the image).

A container can be started in foreground (interactive mode) or in background. The former type of startup will typically give you a Linux command line prompt to interact with the running container. You can connect to a container running in background mode (a more common scenario) using docker exec command. The docker run command has a large number of options for customizing the way the container interacts with its host environment.

Docker also provides commands for listing running containers (ps), listing installed images (images), inspecting running containers (inspect) and stopping a running container (stop or kill).

This diagram describes Docker’s major moving parts.

Creating Your Own Images

To take full advantage of power and flexibility of Docker, you have to build custom images. Here are a few basic rules for Docker image creation:

  • A new image is typically created from an existing Docker image (e.g., CentOS or Ubuntu image) (you can create an image from scratch, but there is rarely a need for that).
  • Docker uses scripts consisting of instructions from a small set of commands to build an image.
  • The script for image creation is placed in a file called Dockerfile.
  • The commands inside Dockerfile allow you to
    • Define the source image (e.g. some flavor of Linux) (FROM)
    • Run Linux command on the image (e.g., yum install) (RUN)
    • Copy files from the host machine to the image (ADD or COPY)
    • Specify a startup command for the image (e.g., start the server) (CMD)
    • Mapping a directory inside the running container to a directory on the host (VOLUME)
    • Expose network ports from the running container (EXPOSE)
    • Label image with meta-data (LABEL)
  • Use docker build command to create an image based on the Dockerfile
  • The Docker image can be added to a Docker repository with a docker push command and downloaded from the repository with docker pull command.

I went through the steps necessary to configure Docker runtime, connect to the PartShop Docker registry, build Docker images for the Squawker project and integrate Docker with Jenkins (to automate image creation).

Starting Apache Storm

Post by Damian Knopp

I recently had the chance to work with Apache Storm and have learned a great deal in the past month.  I hope to share some of that with you in this introductory post.

Apache Storm is a distributed streams processing framework.  The development model is pretty simple to pick up and I have some sample code in github.

Here is the conceptual model;

0. A Apache Storm (http://storm.apache.org/) processing graph is called a “Topology”, similar to an assembly in Cascading (http://www.cascading.org/).

1. A spout reads data from a streaming data source

2. A bolt processes the data

3. Bolts can be chained together, in much the same way as a chained Cascading or Hadoop Map Reduce job.

4. Tuples are the data passed from spouts and bolts, Tuples are named fields and are typed

5. Tuples are serialized in transit, Kyro is the default serializer

6. Processing of tuples are “acknowledged” as they flow thru the system.  This action is call an “ack”

7. Vanilla Storm uses “at least once” semantics for processing its data.  This gives it fault tolerance.  Essentially this means, if a machine dies and no acks are sent for certain tuples, then Storm will reschedule work for those tuples

8. Storm comes with a Trident API for more transaction options including, “exactly once”, I did not use this API

9. Storm partitions work across “workers, executors, tasks”

10. A Worker is a JVM, Executor is a Thread, Task is work executed on an executor thread.

11. Nimbus is to Apache Storm what the Job Tracker is for Apache Hadoop

12. Workers and Executors are balanced across machines by a Supervisor process.  Analogous to a Task Track in Hadoop

13. Bolts and Spouts have a groupings phase where by tuples are sent further up the processing chain.  Different options are available for this grouping phase, including direct grouping to specific bolts, random grouping even distributed across all bolts, field group, key grouping, local grouping where by data is aggregated locally then sent up stream.  In my mind this is similar to a combiner phase in Hadoop Map Reduce.

14. Storm works well with Kafka

15. Kafka has Topics which are like JMS Queues, Partitions which are like shards to parallelize reads and writes and offsets which are pointers to increment and rewind as you read or fail to read

So here is some sample code, notice it runs storm in local mode and does not need a cluster or storm installed to run

https://github.com/damianknopp/dmk-kafka-storm

I would like to wrap up with a few notable points;

While Apache Storm is easy to get started; like many parallel processing systems, it can become difficult to debug quickly should things not work as expected, as was the case for our group.  Still Apache Storm is a leading tool and seen as one of the measuring sticks for tools that do analysis and ingest of terabytes of data in near real time.  Additionally I would like to point out, that one can write Scala in a sane way on projects and be maintainable.  Lastly I would like to note that Apache Storm proved to play nicely in the Mesos resource sharing environment.

 

Using Merge Requests to Improve Code Quality and Team Communication

Post by Michael Smolyak

 

I have long been aware of use of Git merge requests as part of the development process. Last week I started using the technique myself and wanted to share the positive experience with my Next Century colleagues.

For those of you who are not well versed into this tool, merge request is a Git technique for combining the code from a feature branch with the trunk. On many open-source projects it has to be used since not everyone has commit privileges for the master branch. If you are not a committer, the only way for you to contribute to the project is to commit code into your feature branch and then request the committers to merge you code into the trunk.

This is not the case on my current project. All the team members can commit into any branch, thus the use of merge requests is purely optional. Here is how it works. After I complete the feature, I create a merge request based on the feature branch and assign it to my co-worker, typically one familiar with the feature. He, in turn, gets the notification about the merge request assigned to him. In effect, I ask him to inspect the code in the feature branch and, when he is satisfied, merge it into the trunk.

GitLab, the tool we use on the current project, has solid support for merge requests. Creating a request takes a single button click (GitLab shows you your feature branches and asks you whether you want to create merge requests from them.) The GitLab pane displaying information about the merge requests gives the reviewer information on the commits that are part of the merge request and, more to the point, shows all the changes made to the code on a per-file basis. The color-coded display shows all the relevant code snippets where the lines were remove, added or modified.

Another very useful feature of merge request screen is the ability to leave inline code comments. This allows the reviewer to leave comments attached to individual code lines, which are then immediately sent to the code author who may respond to the comments on the same screen. Once the author makes the necessary code changes and pushes them into the feature branch, the merge request screen will reflect the updates and will allow the reviewer to approve them.

The last step in this process is for the reviewer to merge the branch, optionally deleting the feature branch. The reason I find this back-and-forth between the reviewer and the developer so useful is because the incentives and tools of merge requests are well aligned to encourage good development practices such as code review, small feature branches and, most crucially, communication among the team members.

Code review is the integral part of the branch merging process – you as the reviewer do it not because there is a rule that says that your team shall do code review, but rather because you are responsible for merging a feature branch into the trunk. By clicking the Merge button, you are certifying that you’ve reviewed the code and are satisfied with it. Merge requests encourage relatively small and short-lived feature branches, since creating a feature branch with hundreds of lines of changes means asking your co-worker to stop what he or she is doing and spend hours reviewing your code. And to accomplish the merge, there are many informational exchanges between the author and the reviewer fostering application knowledge dissemination among the team members.

If your project does not yet use merge request, I would urge you to give the feature a try – you’ll be happy you did.

Next Century Engineers Mentor UMBC Computer Science Students

Post by Michael Smolyak

 

Working at Next Century afforded me a unique opportunity to interact with the Computer Science students studying Software Engineering at the University of Maryland, Baltimore County (UMBC). For a number of years my company has sponsored a 400-level class at UMBC giving its employees a chance to mentor the students. The goal of the partnership is to infuse students’ academic experience with the sense of a real-life software engineering project with its risks, uncertainties, deadlines, technology choices, demanding customers, stubborn bugs, crashing servers and other joys all too familiar to those of us in the trenches.

This semester, just like in the years past, three of Next Century employees worked closely with 16 students taking CMSC-447 (Software Engineering I) wearing the hats of customers and mentors. In addition, several Next Century engineers delivered presentations to the class on subjects of their respective expertise, which included Agile Development, Web Development, User Experience, Testing and Security.

My role ( as well as that of my colleague Josh Williams) was to interact with two teams of 4 students each presenting the students with the software project to be completed in the course of the semester, approving intermediate deliverables dealing with product requirements, design and implementation, providing them with the feedback along the way, while at the same time trying to share with the students some of the experience and knowledge of web application development, risk mitigation, performance tuning, UI design and customer interaction strategies. In addition, Laurian Vega (a Senior User Experience Engineer), served as a UX mentor for all the teams.

The Open Baltimore Data visualizing crimes from 2014

The class project Josh and I have chosen for the CMSC-447 students was to use open-source Baltimore Crime Dataset to implement a Web application capable of assisting the police in visualizing and analyzing the crime data. As proxy customers, we asked for three views of crime data: Map View overlaying the crime information with the map of Baltimore, Table View with pagination, filters and sortable columns and Chart View displaying pie, bar and line charts for analyzing Baltimore crime statistics. Shown to the left are different visualizations that Open Baltimore already provides. The students had to review what was there and build their own implementation.


In mid-January, John McBeth (the President and CEO of Next Century), Josh Williams and I introduced ourselves and Next Century to the group of 16 UMBC juniors and seniors taking CMSC-447 and their professor Dr. Karuna Joshi. John told the students about the company, its mission, its people and its services, while Josh and I described our concept for the application intended to help Baltimore police visualize crime statistics. By that time the students divided themselves into groups of four. Josh and I randomly pick two groups each.

The groups I supervised named themselves Next Millennium (apparently, a play on our company name) and Big Bytes (likely emphasizing their appetite for knowledge). We interacted on a weekly basis. After the first introductory session through Google Hangouts where we talked about the application they had to build, we spent several weeks exchanging documents. The students were tasked with producing a requirement and a design documents. During the early weeks of the class, the teams would send me their drafts and I would respond in a day or two with my critique of the structure or content of the documents. The student’s corrections would follow until the “customer” was satisfied with the requirements and design.

The culmination of the students’ design effort was the midterm team presentation of their application designs to Next Century and our headquarters. Josh, Laurian Vega (who added a UX voice missing from our team of mentors in prior years) travelled to UMBC a week before the class to answer students’ questions about application features and user interfaces. On the day of the presentation 16 students dressed in business attire (many of them in suits) filed into Next Century lobby led by Dr. Joshi. The 90-minute presentation featured four distinct designs for the Baltimore crime data visualizer with names like Baltimore Crimalyzer, Klystron-911 (in reference to Klystron9 weather visualizer) and CRAB (Crime Rate Analysis in Baltimore). Every team received feedback from their professor and the Next Century mentors.

A group presenting their final project.

The weeks that followed the midterm presentation were dedicated to the implementation of the respective systems by the four teams. Each team chose technologies for their applications (JavaScript front end and back-ends using MongoDB and MySQL databases and a variety of server-side technologies including Node.js, Python, PHP and Java). I had weekly interactions with my two teams over Google Hangouts advising them on the use of technologies, UI features and performance considerations of their systems.

A week before the final presentation, Laurian, Josh and I again attended Dr. Joshi’s class in person to answer students’ final questions regarding their projects. Josh and I worked with our respective teams, while Laurian advised all 4 groups of students helping them improve user experience of their programs.

May 3rd was chosen as the date for the class to present their final projects to their professor and Next Century management and mentors. The presentation was also attended by Deputy Chief of UMBC police Paul Dillon. One of the team contacted officer Dillon to solicit his feedback of their project and upon learning about it I invited him to the final presentation to see the fruits of the students’ efforts.

All four team made good impressions on their “customer”. The projects, different in visual design and overall feature set, all had core functionality implemented according to the customer specifications. The apps loaded the Baltimore crime data from their online source into the local database server and created user interface allowing the customer to perform sophisticated data filtering by date range, crime type, crime weapon or location of the crime. The three views presenting the filtered data — as a heat map, as a data table and as a series of charts were also implemented. At the midterm presentation John McBeth threw the teams a curve ball asking them to implement date filtering using a histogram slider based on a custom component created on one of Next Century projects. All the teams came through and created some version of a date slider accompanied by a histogram showing the total number of crimes for every day of the selected period.

A group presenting their final project

Deputy Chief Dillon commented during one of the presentations that many of the products done by software development companies for the police department’s that he had seen in the past cost millions of dollars and did not have many of the features present in students’ Web sites.

Klystron-911, one of the final projects, demonstrates it’s top filters, a time-based histogram, and a heatmap with clusters of all crime data
Baltimore Crimalizer, one of the final projects, demonstrates its first tab of data showing a heatmap and histogram.
Baltimore Crimalizer, one of the final projects, demonstrates its last tab showing a charge of crime type by time.

At the end of each team’s presentation, Marco DePalma (Next Century’s Chief Operating Officer) asked the team members what they learned in the course of their Software Engineering class. Among the common themes of what the students learned were

  • How to handle a situation where the technology stack for the assignment is not preselected by the professor, but has to be chosen by the team out of a large number of options
  • How to work as part of a team of developers, where each member works on a small part of the application and to integrate individual contributions in a single code repository.
  • How to interact with a customer and incorporate customer feedback into the product being developed
  • How to trust a UX expert even if her advice contradicts one’s gut instinct

I enjoyed the experience of working with the college students. The best part of it was seeing the transformation of a group of strangers into a cohesive team of software developers and that of a mere idea of an application into a quartet of well-conceived and well-implemented web sites serving as a testament to students’ competence, creativity and perseverance.

Building Docker Images

Post by Damian Knopp

 

I recently had to navigate a data center that ran exclusively using Docker images with Apache Mesos (http://mesos.apache.org/) and Marathon resource schedulers.  Doing so required me to become familiar with Docker quickly.  I appreciate the conversations on this blog about Docker, since I believe they helped me even if I did not know it would when I read the posts.  Thru my travels I have been surprised by how much material has been imported, still I do believe there is a learning curve.  Here is my attempt to keep the conversation going.

If you are just starting to learn to use Docker images; here is a cheat sheet listing a common Docker workflow, https://gist.github.com/damianknopp/9cf55959a4f403cfc314

However in this post I really wanted to talk about building Docker images and some common practices.  I have some barebone Dockerfile scripts posted here,https://github.com/damianknopp/docker-build

If you look thru them, here are a few tips on building Docker images;

1. Use a minimal base to keep images small.  For example; FROM centos:latest

2. The Dockerfile commands for example; COPY and RUN (https://docs.docker.com/engine/reference/builder/) create a filesystem cache layer.  Cache layers are invalidated when you modify your Dockerfile at and below your point of change. So try to put commands that will not change often to the top of the file.  This will speed up your development

3. Put multiple commands on one line and clean up on that line, again to help the file system cache.  Notice I even clean yum package after installing.  For example; RUN yum install -y java-1.8.0-openjdk wget gzip python python-setuptools && yum clean install

4. Docker containers do use the init.d, systemd, or upstart initialization systems by default.  Some people use the Python supervisord in its place.  I found it to be pretty handy to use.  Supervisord will restart processes if they are killed

5. Docker commands and logging in with exec bash run as root.  Again supervisord may help you run processes as a different user, but I didn’t set that up and it is not as common as practice as you may think.  Browse dockerhub for yourself to verify this assertion.

6. Ports are not exposed unless you use the EXPOSE command in your Dockerfile (https://docs.docker.com/engine/reference/builder/#expose) and/or use the -P/-p options at docker run time.

That’s all for now, thanks for reading.