HiRISE: High Resolution Imaging Science ExperimentThe University of Arizona
New Images Catalog Anaglyphs Stereo Pairs Science in Motion FAQ HiBlog Themes Software Contact Search

Improvements to Daily Data Monitoring

All downlink tasks I perform follow a particular development path: (1) I practice and jot down manual procedures; (2) over time I attempt to automate subtasks by using Perl or other tools, as best I can (I am not a software developer); and (3) our talented software developers write software that automates the task completely, or at the very least speeds it up considerably. Of course, this is always just in time for me to be assigned new tasks!One of my daily tasks is monitoring data quality and paying attention to missing or gapped raw image channel files (up to 28 channels per observations.) My tools: our internal reporting website HiReport, Terminal, and Microsoft Excel. I look through a list of observations in my web browser, click on those that appear to be missing channels or are flagged as “INCOMPLETE”, and copy and paste information about the problematic channels into Excel. I then add some additional notes and take any required actions.

A few days ago I realized that after more than a year, I was still in the manual stage of monitoring data quality and not making good use of existing tools to help streamline the process. All that cutting and pasting was beginning to get ridiculous, even more so during a period of high data rates and an increasing number of observations.

When we receive a product with gaps, our automated processing pipelines usually process it just fine. Sometimes, however, there are pipeline failures. I keep a list of these failures and the actions I have taken to try to correct the problem. This generally means I have to manually processing the file or repairing its metadata header. Once the channel is repaired and/or recovered, then I can feed it back into the remaining processing pipelines.

These gapped products include “_G” in their file name. In a spectacular “Duh!” moment, I realized that rather than hunting in my browser through a list of observations that might have gapped product names, and then copying and pasting any I find into Excel, I could instead just perform a command line search in Terminal for all raw data products with “_G” in their file name.

     % ls -1 */*_G*.DAT

Why didn’t I do this over a year ago!? I am embarrassed to say I have no idea.

This only provides a list of products with gaps. Missing channel products usually arrive as gapped channel products 48 hours later (due to the nature of the automated processes that handle new MRO data at JPL.) In my spreadsheet, I scan through the missing channels and determine if a gapped file has in fact arrived. If it has been longer than 48 hours I consider the data lost for good and I “force” whatever observation data we have received through the pipelines. Of course, sometimes lost data is found or reprocessed at JPL long after the 48 hours has expired, and very rarely these new data will arrive. This requires reprocessing of the entire observation so that this newly found data can be added in.

A long time ago, our database specialist wrote a Perl script to create a daily list of missing and partial observations. For whatever reason, I stopped using this tool and then promptly forgot about it. Instead I started looking through the list of observations in my browser, clicking on those that were not complete, and manually figuring out what channels were missing. Making use of this missing EDR’s tool is so much easier and faster. Again, “Duh!”

The output from these quick searches for gapped and missing products requires a little bit of tweaking to make it look nice in Excel, but then a quick sort merges the two list into observation ID order. In no time at all, I have my list of observation products to follow up on for the day. My copying, pasting, and mouse clicking madness has been vastly reduced.

Every day I see the very worst data, generally caused by data transmission problems, but these data make up only a small percentage of all HiRISE data. To try to quantify this, I counted up the number of channels I have reported with problems (they have gaps, were missing, were somehow corrupted, etc.) and then divided this number by the total number of raw data files we have received (these are transformed into EDRs after being downloaded to Tucson.) As of this morning, I have listed 5392 channels with problems, and we have received 160,858 raw channel files. This is roughly 3% of our data, although most of these have some useful data in them. Even an observation with gaps or missing a channel or two is of potential use to a scientist. If too much data has been lost, then our targeting specialists might command HiRISE to try again in a later orbit.

Over time, we have developed many procedures for dealing with all sorts of problems. Now that I have sped up my daily data quality monitoring, I will have more time to improve these procedures, partially automate as much as I can, and provide suggestions to the software developers about tools that would make my job ever more efficient.

Note: My calculation of problematic data above is not rigorous and only a rough estimate. One flaw: I am counting missing channels in the numerator but not in the denominator.

Tags: , ,

Leave a Reply