This page may be out of date. Submit any pending changes before refreshing this page.
Hide this message.
Quora uses cookies to improve your experience. Read more
Robert Walker

Well the software is very thoroughly tested so it doesn’t have bugs in the normal sense. But it did have what you could say is more a bug in design than a mistake of programming.

This showed up on sol 200, so 200 days after the landing. Its computer ran into issues with its computer memory - reporting dozens of memory glitches. Soon after that, it refused to go to sleep, and also refused to do almost any of the work it was scheduled to do for the day. It wouldn’t do anything that would lead it to write to memory.

When things go wrong with the main computer, Curiosity should automatically activate its “co-pilot” - a second computer that is all set to take over if the main one glitches. But with this glitch, it didn’t do that.

So, would a simple reboot fix it, or was it safe to leave it alone? As you can imagine the software engineers were tearing their hair out - and they found that there just wasn’t any way to find out if that was a safe thing to do by interrogating it.

Meanwhile, two of the engineers managed to replicate the behaviour by damaging the memory in their backup computer. And they called up with a report

“We’re able to reproduce the situation, but you’re not going to like it. The next time the rover tries to communicate, it will probably hang up and turn the radio off. The fault protection never trips and does not try to fix the problem.”

So in other words, just a few hours from then, it was about to just switch itself off, and turn into an inert lump of metal on Mars that they’d never be able to communicate with again.

They tried to solve it and they came to the conclusion there was only one way out of this, to try to kill the computer. If they did that, hopefully Curiosity’s backup computer would take over. This would leave Curiosity with only one computer, and if that one didn’t work, that would be the end of the mission.

So anyway they did that - they found a way to basically kill the main computer, so that it would be forced to reboot using its backup computer. Mars Rover Curiosity

So anyway they managed to do that, and luckily the backup computer did wake up and take over, and that fixed the problem. But it was a close call. They have now repaired the primary computer which can be used as a backup if the same thing happens to the “co-pilot” in the future. The Software Bug That Almost Killed Curiosity Just Six Months In

It shows that in software like that you have to deal not just with the way it deals with normal situations, but also, how it copes with memory failures and other anomalies.

They did do a lot of conventional bug fixing and testing - and they also wrote the code in such a way that it could be tested with automatic bug fixing tools - a method used a fair bit now for the most critical code. It makes the code much slower to write, but more reliable which is the priority. They used Gammatech’s CodeSonar every night while working on the code. They also used Coverity.

But that’s not enough by itself. It can’t catch conceptual problems, if the program works as designed, and has no software bugs in it, like this one of, “what happens if the memory is corrupted in a particular way”.

This automatic bug detection can’t even catch the mismatch of imperial and metric units that crashed the Mars Climate Orbiter, or the mistake Schiaperelli made of cutting off its rocket engines too soon because the spacecraft span too quickly, which confused it so that it thought that it suddenly jumped from miles high into the ground to below ground level in seconds.

PARALLEL CASE OF SCHIAPERELLI

The Schiaperelli crash is rather similar in some ways - not an exact analogy, more a parallel, not a memory corruption, but the spacecraft spinning too quickly (probably because of “uneven disintegration of its thermal blankets and associated hardware” (6.2.3 of the Inquiry) - basically it probably had some stuff stuck to it which got caught in the wind of re-entry and set it spinning, something that also happened during the Curiosity landing.

This confused its software written assuming that its spin rate wouldn’t go above a maximum value. The software was written correctly, bug free, behaved exactly as it was meant to. It’s the specification for the software that was the problem.

As David Parker, ESA head of robotic exploration, said in an interview

“The software behaved the way it was supposed to. It should have been anticipated that the rotation could reach the maximum. The software could have been more robust had it been more cleverly designed.”

Schiaparelli landing investigation completed

That’s not dissimilar - the software worked just fine in normal circumstances. But when something strange happened as with this anomalously fast spin, it needed an extra level of attention to make sure that there was a “sanity check”. Their investigation concluded that with something like that it would probably have landed successfully. The “sanity check” software would have just said “this reading must be wrong, I can’t suddenly be below ground level after being miles high moments earlier” and continued with the mission as if it was nominal and used other methods to double check its altitude.

Its behaviour was quite strange if you think of it in human terms, which of course computers are not. It detected that it was below ground level, so it then fired its retrothrusters for the minimum possible length of time. Then after doing that, in order to “land” as it thought, from its detected position under the surface of Mars, it then started its post landing sequence - all of this while falling towards the ground at great speed, undetected.

A basic sanity check would include rules such as

  • It can’t change from 3.7 km above the ground to below ground level in one second.
  • It can’t be below ground level

It would also check the acceleration and check with the pre-flight timetable for basic sanity. See recommendation 05 of the Inquiry.

They recommended collaborating with NASA in the future, who have more experience landing on Mars, asking them to validate their models. Also recommend that a third party checks over the software for issues like this.

The report is here ExoMars 2016 - Schiaparelli Anomaly Inquiry

So, now we have these tools that can help us to produce virtually bug free code - at great expense but worth it for critical missions like that. But they still don’t get rid of the possibility of conceptual bugs like that, where the specification itself doesn’t take account of all the things that could go wrong, and doesn’t have code in place to notice those situations and deal with them if they arise.

We can do that too, but it is harder and it’s more intuitive, it’s based on understanding and experience, not sure how easily it can be automated. I can’t see computers doing that any time soon as at some level you need to have some understanding of the physics of the situation, what’s actually happening in the real world and not just as specified in the coding. Even expert systems are based on training the expert systems by humans who understand what is really going on. Deep Blue has no idea what a chess set is, and Alpha Go has no idea what Go is, and the self driving cars have no idea what a car or a road is. In all of our clever programming to date, impressive as it is, even with all the “deep learning” and everything, there’s nothing there that truly “understands” in the way a human does.

About the Author

Robert Walker

Robert Walker

Writer of articles on Mars and Space issues - Software Developer of Tune Smithy, Bounce Metronome etc.
Studied at Wolfson College, Oxford
Lives in Isle of Mull
4.8m answer views110.3k this month
Top Writer2017, 2016, and 2015
Published WriterHuffPost, Slate, and 4 more