Apr 14 2007, 05:44 PM
I totally agree, Mike. One of the questions I wasn't able to get an answer to, which I would have liked to include in the article, was: how many times did MGS encounter a fault, enter safe mode, and recover successfully because its fault protection worked? Its 10 years were made possible by lots of "lessons learned" from previous missions, and its demise, though sad, does give designers insight into a whole 'nother set of potential faults that they can now plan for, and help make sure it never happens to another mission.
Until robots really do become intelligent, I fear it's much more likely for a long-lived mission to fail unexpectedly due to some bizarre chain of unforseen events that human programmers just didn't plan for, than for the mission to fail for purely mechanical reasons. It seems to me that we now make plans to end missions before they fail for mechanical reasons, and deorbit them or take some other such protective action. But you just can't plan for every possible human error. You just have to try to plan for everything that's remotely likely. They just didn't plan for this particular bizarre string of events.
Apr 14 2007, 08:52 PM
QUOTE (elakdawalla @ Apr 14 2007, 10:44 AM)
One of the questions I wasn't able to get an answer to, which I would have liked to include in the article, was: how many times did MGS encounter a fault, enter safe mode, and recover successfully because its fault protection worked?
You can read through the status reports at http://mars.jpl.nasa.gov/mgs/status/reports/msop-mgs.html
looking for "safe mode", "contingency mode", and "c-mode".
Apr 15 2007, 12:36 AM
Emily, "bizarre strings of events" are almost always how mishaps occur in aviation & probably in every other field of endeavour as well. Good systems engineering strives to minimize design features that might induce single-point and at least some chained failures, but ultimately in the real world external systemic influences add many layers of complexity (and often thousands of variables) that can never be completely controlled. This is concisely and quite accurately summarized in pop culture as "**** happens", of course...
I am convinced that this is a fundamental heuristic of the Universe, and unfortunately probability implies that the most unlikely chain of events will someday occur to induce an uncontrollable amount of entropy into any given system, thus making its future behavior impossible to predict with accuracy. The MGS ground team did nothing fundamentally wrong; in fact, despite the prima facie
tone of my previous post, I meant no criticism of them at all. Lessons learned to realize small single-point improvements is all we can do; entropy will always
win in the end, despite our best efforts.
Apr 15 2007, 02:54 AM
"I am convinced that this is a fundamental heuristic of the Universe, and unfortunately probability implies that the most unlikely chain of events will someday occur to induce an uncontrollable amount of entropy into any given system, thus making its future behavior impossible to predict with accuracy. The MGS ground team did nothing fundamentally wrong; in fact, despite the prima facie tone of my previous post, I meant no criticism of them at all. Lessons learned to realize small single-point improvements is all we can do; entropy will always win in the end, despite our best efforts."
I have an abiding and long-lived personal fascination with the concept of entropy, because I've been friends with the Los Angeles punk band by that name.
Humans trying to define laws of nature become enforcers of those laws. "The system as defined is now perfect, and it will work indefinitely into the future" is a frustrating mental block in the human effort to define the universe and reform it in our own image.
If entropy killed MGS, it was just tiny particles of entropy. Entropy on a grander scale would turn MGS, Mars and the entire human endeavour into some kind of mush, would it not?
Apr 15 2007, 03:15 AM
QUOTE (brellis @ Apr 14 2007, 06:54 PM)
Entropy on a grander scale would turn MGS, Mars and the entire human endeavour into some kind of mush, would it not?
Actually a very cold and very diffuse gas consisting of disassociated ions would be a more accurate fate of the craft if entropy were taken to it's extreme.
Mush has way to much energy and molecular organization.
Apr 15 2007, 03:39 AM
To continue the nihilism: a cold vacuum of slowly decaying protons amidst a sea of barely energized leptons several trillion years from now...
No matter. We all do the best we can, and the MGS team performed WAY beyond any initial expectations. (How inspiring, how refreshing, to see such magnificent dedication, brilliance, and innovation to make this mission last so long, yes? This is the spirit of humanity at its very best.) Each and every one of them should get a medal as far as I'm concerned for making truly significant contributions to human knowledge and exploration. I envy them the private satisfaction they each must feel for doing something that really
meant a great deal, not just now but a thousand years from now...
Apr 15 2007, 05:43 AM
QUOTE (nprev @ Apr 14 2007, 08:39 PM)
We all do the best we can, and the MGS team performed WAY beyond any initial expectations. (How inspiring, how refreshing, to see such magnificent dedication, brilliance, and innovation to make this mission last so long, yes? This is the spirit of humanity at its very best.) Each and every one of them should get a medal as far as I'm concerned for making truly significant contributions to human knowledge and exploration. I envy them the private satisfaction they each must feel for doing something that really meant a great deal, not just now but a thousand years from now...
I wholeheartedly agree. Think of how far humanity could go if we all achieved on this level?
I still wonder about a simple Disk Repair program
Apr 16 2007, 08:48 AM
I read the anomaly report that NASA put out, and I can follow what happened based upon similar experience. I've screwed up enough while at the console to understand what happened.
From some of the comments here, people may have a very different picture of how the usual (old) spacecraft console software works as compared to the reality. I'll try to make a few points, and maybe people can tell me if I'm missing anything.
In my experience, the old console software is not very high tech. The projects are always run by hardware guys, who don't know much about software. And the pressure in the programs, before launch, are almost always hardware driven. So, you end up with software which is not exactly state of the art, being used to control hardware which often IS state of the art.
Most of the old ground software I've used is very manual. So for instance, I can understand exactly how the report's errors occurred. There is a command prompt that asks you for the value of the parameter you want to change, and the memory address in RAM. You type it in. You don't change the redundant side's values at the same time, until you know it worked on the primary. So later, another guy repeats it for the redundant side. He types it in. But he types it in different than the first guy. Error 1.
Later, you do a memory dump. These were generally crude tools that spit out pages and pages of hard copy, in hex, with very little technology to help you make sense of it. It is fingers moving over the page, finding values in two places that match (or don't). But as usual, they found the problem. Good job, team.
Now they do it all again. Run the memory update program, enter the addresses of the parameters by hand, then enter the correct value by hand. But in this case, the parameter was entered correctly this time, but they typed in the address wrong. Error 2. I've done that. And it isn't pretty. Generally, the old console software won't catch it. It will do whatever you tell it to do, and put whatever you want in any location. There are no limit checks, no graphical displays to show you in what location these parameters are actually going. There's nothing to back you up. So you should have people double check what you're doing. But into your fourth mission extension, it may not seem that important. Your computers have been shoved into some corner to make way for more important things, you have people working part time whose real focus is on other things, none of your managers are paying attention to the mission anymore, so nothing you do on it seems like it is going to exactly help your career. The edge, you could say, is missing, and inevitably things happen. Usually recoverable, sometimes not.
You are not working with changes in operating systems. No one changes the operating system once it is launched, unless you do a patch to fix a serious problem. You do everything in your power to forget you have an operating system. You just work with parameters, whenever possible. And you change parameters by making direct writes of numbers to specified memory locations, all of which are entered manually at a prompt. Type in either a bad address, or a bad parameter value, and if it goes through unnoticed, you have a time bomb in your RAM. A memory location where the parameter should be between X and Y, and you just put in a value of Z = 3Y.
A lot of these problems with the ground systems are now fixed, with missions starting off with much better ground systems than the older missions had. MGS launched in 1996; the ground system software was locked down at least six months or a year before that. The software was probably based on designs from the early nineties. As with everything else, the ground software has changed a lot between 1992 and 2007. And it has changed because of exactly the kinds of errors that got made on MGS. But since no program every spends budget to improve the working software of old missions, things like this can happen.
As for safe modes, keep in mind that any spacecraft safe mode is designed to handle a single fault. No one even attempts two-fault solutions, because anything beyound single-fault planning gives you an almost infinite number of possibilities to plan against, which cannot be done on the budget you have. And when you enter safe mode, the flight code uses defined parameters in RAM. You can have a perfectly lovely safe mode definition, but if the parameters have been corrupted, all bets are off; anything can happen. If you think that using some sort of safe mode that is absolutely hard coded would be safer, I would disagree. Things are learned after launch, often very very disturbing things. Having the flexibility to alter the parameters is much safer than not. And this flexibility allows you to tailor the safe mode to things like failed hardware, which cannot be planned for in advance.
There is talk about how the lessons learned from this will include periodic end-to-end reviews, looking into how the manned program does things, and ways to keep the operators fresh and enthused. Well, end-to-end reviews that will actually be detailed enough to catch parameter discrepancies are long, detailed, require experts who are working on current programs with tight deadlines and budgets, and require money to fund them. The human spaceflight side has a lot more money for these things, because lives are at stake. Unmanned missions get their fourth extension based on the fact that they promise to spend almost no money at all, otherwise the spacecraft would have been shut down and hurtled into the planet. These are the sorts of things managers say at times like this, but when it comes down to funding them, count me as quite sceptical. New missions will take priority for the cash. And sometimes, that is the right decision.
There are people out there who know a lot more about the MGS specifics than I do. If I'm way off, let me know. But this was my take on the whole thing, for what its worth.
Apr 16 2007, 02:56 PM
Interesting and valuable insight from someone who's been there, cndwrld...thank you!
There's a lot of trade space between flexibility & foolproofing in human/machine interface, but in your general examples it sure sounds like the bias is sometimes set too far to the former. Setting up a table of parameters in MS Access or something for each of the redundant databases & then continuously comparing them for equality (and flagging fields that don't match) doesn't seem too hard or expensive to build.
Foolproof? No, nothing really is. I'm sure that many if not most SV operators do something exactly like this, and bad things still happen.
Apr 16 2007, 03:37 PM
That was an excellent summary, Don -- it demonstrates what I've been saying all along, that the limitations on most every human endeavor have more to do with financial and schedule pressures than they do with the limits of our technology or imagination.
Now, as we all know, there are a lot of ways to automate the processes you discuss. Heck, back in Gemini days, more than 40 years ago, command loads to the Agena target vehicles were sent up pretty much exactly as you describe, here. But even back then, they had an automatic comparator that would check the command load as sent against the command load as received by the Agena. Only when that comparator failed did they end up digging through printouts of the command loads to verify that the load was properly received.
Now, that's not exactly the same as comparing an actual command load to a desired command load, but its similar in process. And thus, the technology to error-check a lot of this stuff has been around for a long time.
As you have so effectively pointed out, the ground support stuff is usually designed (or used off-the-shelf) to do its job, bare-bones, no extras. Error trapping is almost non-existent.
And lest anyone think that this is just an issue with ESA's efforts, recall that an average command load to the MERs requires most of an individual's workday to prepare -- seven or eight hours. We all know it's *possible* to create error-trapping front-end software for such things that would allow a rover driver to tell the front-end interface: "We want to drive 20 meters in this specific direction, take the following image series, and then prepare for an overnight Odyssey pass." It's very possible to set it up so that creating and radiating the appropriate command series would take the rover driver 10 or 15 minutes, and the front-end would ensure that all commands sent to the spacecraft would be safe and properly executable. Why isn't it done like that? Probably because it would have cost too much in time and money to develop such a front-end system in the first place...
-the other Doug