How do we ensure our exploration robots keep working after software updates?
Traditionally, space missions solve this problem by:
- Avoiding advanced software techniques that are seen as risky. Once upon a time, "advanced software techniques" included compilers. These days, the team might enforce coding conventions like not throwing exceptions, so you have to write explicit handlers to recover from every fault.
- Thorough code review by the mission's software engineering team. Changes are vetted by a change control board that decides whether the new functionality is worth the risk of making a change that could break the system.
- Automated software verification. This is everything from lint to special-purpose tools that will do things like detect possible race conditions in multi-threaded software.
- Testing in simulation. Sometimes these simulations are very elaborate. It could be a video game-like simulation of a detailed planetary surface environment for testing rover software, or a simulation of a system of valves and motors that injects faults like jammed valves or bad sensor data.
- Building a twin of the flight hardware that stays behind on the ground to support testing software updates in a realistic hardware environment.
These are all great techniques, but sometimes you don't have the resources to do all the testing you want. That's the situation we're facing on the NASA Planetary Lake Lander Project. Our robotic probe is a buoy sitting in a high-altitude lake near Santiago, Chile. It collects water quality, weather, and health data and beams daily updates back to us via a BGAN satellite link.
The lake is pretty remote. During the Southern Hemisphere summer we can send an engineer to service the probe about once a month, but during the winter the lake is snowed in. If we lose contact with the probe we could end up losing months' worth of data.
One of PLL's research goals is to use the probe to conduct adaptive science, where it intelligently chooses what measurements to take and what data to beam back based on scientific relevance. Ideally, we'll build the adaptive science system over the course of this year and deploy incremental updates to the probe as we bring new features online. But if any of the updates break our core software services that schedule activities and connect to the satellite network, there would be no way for us to remotely contact the probe to fix it.
We've hit on a simple technique we hope will help recover from that problem. When we install a new version of the software, we'll keep older versions around as well. Instead of totally switching over to the new version, we actually set up a rotation where we switch back and forth between versions every few days. That way, if the new version is broken and doesn't allow us to make contact with the probe, eventually the old version will rotate back in and we'll regain contact. On the other hand, if the new version works out really well, we can manually step in to disable the rotation before it switches back.
Some key considerations for this strategy:
- First, do no harm. The world is full of "safety systems" that actually make things less reliable. We put lots of sanity checks into our rotation software, and if it finds any surprises, it gives up and doesn't change anything.
- Rotation makes the most sense if you have a critical need (eventually make contact), so you want to try everything, but none of the options is likely to cause permanent damage to the system. Luckily, our probe is moored in place on the lake, so it's hard to get into trouble.
- There is no silver bullet. Even with rotation in place we plan to carefully verify each new software version to the extent that we can.
- It's important to keep the software that does the version rotations simple and independent from the rest of the software. Don't rotate the rotation script!
Time will tell if this is a good strategy. It would be interesting to hear who else has hit on the same idea.