The Race Is On
The HiRISE project has developed a fairly significant amount of software. I’ve been privileged to play a part in that development, which continues even as we get deeper into the primary mission. So, rather than space science or operations, this post will discuss one of the nittier, grittier aspects of our work.
The processing pipelines have been introduced in earlier entries. Thanks to the efforts of HiRISE developers (mostly before my time with the project) these have provided a very solid foundation for our automated ground data system. There has been very little need for trouble-shooting or fine-tuning of the core software.
One issue that did come up earlier in PSP however was a strange failure that happened periodically, though not predictably. If you are a programmer, there is nothing so dreadful as a bizarre, non-repeatable bug… not counting Monday morning meetings, of course.
This particular bug happened when a new observation came down. Each processing pipeline needs to make a new directory, a new folder in the file system where it can store a log of its work. Starting with the transition phase, most of our pipelines have two or more instances running in parallel on a small Linux cluster, each working on part of the observation.
The bug was that, occasionally, the directory creation would fail. Someone from downlink would have to investigate and take corrective action. Or, if the way forward was unclear, to call in somebody from the software development crew.
In this case, the problems seemed to happen downstream from “my” pipelines: I was blissfully unaware that it could effect HiDog and EDRgen as well.
Then on Thursday, Rich emails to say that the workarounds put in place by two other developers seemed to have helped their pipelines, could I add them to my code as well?
A little shocked, I replied (no doubt too haughtily) that I’d rather understand the problem than apply some hasty workarounds. It had all the classic hallmarks of a multi-threaded problem: notoriously hard to repeat & difficult to debug, but always when two processes were going after the same thing.
The particular bit of code—a subroutine shared by each pipeline to create a log directory—looked absolutely fine on first inspection. It uses the perl language and invokes a function called mkdir. That really should be no problem, I thought. Perhaps if there were a permissions problem. But this succeeded more often than not.
So that evening, I decided to try a little test: a small program that I’d run simultaneously in two windows. Each one would try to make the same directory, whose name would contain the current time in seconds. Since the mkdir operation should take much less than a second, there should be frequent “collisions”—if that was really what the problem was.
As a developer, you grow accustomed to the fact that almost nothing works right on the first try; you have to take small steps and constantly check yourself. So I first tried a test in just a single window. It looked like this:
% while 1
? perl -e 'print "mkdir failed!" unless(mkdir $ARGV[0]);' `date '+%S'`
? end
That should succeed every time, I thought, then I’ll move on to trying it in two terminal windows at the same time. But lo and behold, I pressed return and started getting many failure messages!
In the blinding flash of the obvious, I realized what I should have known: that mkdir returns false, indicating it failed to make the directory, because it already been created. My test script was interfering with itself, succeeding when it was a new second and failing every other time.
This explained the bug. It was what is called a race condition. The condition in this case was the creation of a directory. Two pipelines were racing to create the same directory. Occasionally they were close enough that they were calling the same subroutine, and both bypassed its initial check that the directory already existed. One got to the mkdir step a fraction of a second ahead of the other, probably due to a slightly heavier load on the machine, or other variables.
Compounding the problem, the workarounds that had been added in several previous versions (retrying up to five times with random delays) failed to check properly. A single check after mkdir would remove the race condition. Testing this (in parallel) works as expected.
Or a simpler way is to call the Linux mkdir command with an option so that it does what I (and perhaps the other developers) thought it should: to return true (success) if the directory exists or was created.
Tracking down a mystery like this is a fun part of the job. But only if it happens infrequently.

