IPB
X   Site Message
(Message will auto close in 2 seconds)

Welcome Guest ( Log In | Register )

Sol 22 anomaly, File system problem
elakdawalla
post Jun 18 2008, 09:39 PM
Post #1


Administrator
****

Group: Admin
Posts: 5172
Joined: 4-August 05
From: Pasadena, CA, USA, Earth
Member No.: 454



Today's press release from the Phoenix mission contained the following nugget of information:
QUOTE
Newly planned science activities will resume no earlier than Sol 24 as engineers look into how the spacecraft is handling larger than expected amounts of data.
This sounded alarming, and immediately brought the very scary Spirit sol 18 anomaly to mind. (That anomaly, in brief, had to do with too many files being kept in flash memory, which resulted in Spirit descending into a cycle of continuous reboots that might, if not stopped, have depleted the batteries and killed the rover within a day or two. Through heroic efforts Spirit was recovered and obviously returned to perfect health.)

I requested an interview with someone from JPL and am happy to say I just got a call from Barry Goldstein. I'm copying here the entire text of what he said to me. I will be blogging this but am wondering if someone here could help expand a bit on the business about APIDs (Application Process Identifiers) and what part they play in an operating system. I started off by asking for more detail on the problem, and for him to compare and contrast with Spirit Sol 18.
QUOTE ('Barry Goldstein')
When the anomaly happened with Spirit, we lost communication. We never lost communication or control of the vehicle here. It's quite different. On Spirit we had a file management problem that ran amok.

What happened was, at one of the downlinks on sol 22, the engineering housekeeping data was being looked at by the spacecraft team. And they noticed one of the APIDs for a housekeeping data packet, which is normally generated only one to three times every time we do an uplink, was generated 45,000 times. It was a surprise, to say the least. And the reaction of the team was, the obvious which was concern about why the heck did this happen, and the other issue was we were concerned about two things. One, since the APID priority for this data type was very high, would it starve out any of the science data from being saved overnight because it's now so large? And the resolution of that it turned out, yes indeed, it was that large, and we ended up losing very low priority science data from sol 22. But the scientists are not at all concerned about that. The second concern we had yesterday was, we had a restriction on the amount of time it takes for the spacecraft to boot. I can't remember the total value but it's over 60 seconds. If it doesn't boot within a certain amount of time, it will reset and then eventually go over to the B side (it's block redundant, unlike MER). The reason we were concerned is that this data structure, now which is huge because of these 45,000 blocks, it has to pull that out of the flash as part of the boot process. And so we were concerned it would take too long and therefore it would side-swap. So we took some emergency action last night, and I'm happy to say we got the uplinks in due to the following things. Number one, we updated the priority of that APID such that it will restrict the amount of that data type to be saved in flash. Second thing we did is we lost science operations on sol 23. Third thing we did is up the priority of the downlink of that data structure that we generated so often so that we could retrieve what we have so it could help us diagnose the problem. The current state of the spacecraft is as follows.

We have the data down, we have the spacecraft under control, we have the size of the file system in control such that we're no longer worried abou tthe size of the file system growing and keeping us from booting appropriately. The second thing is, the only restriction we put on science activity for sol 24, which the science team is planning right now, is that they can't save the data to the flash because we want to keep the flash small, we don't want this thing to eat us alive. So what the team is doing now is planning sol 24. However, there's a little paradox here. Because we were in this anomalous state, we requested and received a bunch of contingency passes from MRO and Odyssey. So what ends up happening is we told the science team you can do whatever you want, because the only thing we are worried about was flash, we just are not going to save it to flash when we turn off. And we then told them we have all these passes. So as it turns out, what the science team is planning is the most data-rich sol we've had to date, because we have all these extra passes. I was joking with Peter that he should pray for these things more often because he gets more data.

{What other kind of memory is there besides flash?} We execute out of RAM, and every time we turn the vehicle off to save power at night, charge the batteries, we save off the critical data structures which include this file system with the telemetry that has not been marked as received on the ground. And that's what really ate our lunch is the saving of this to the flash. We ran out of room in the flash and that's what caused them to lose the science data, which was low priority. And then it's the time it takes to read it out of flash and get it down on the ground.

{What's generating all these APIDs?} We have a suspect, and I'd prefer not to go into a lot of detail, but the suspect has to do with the packet counter number for each of the packets that are stored. It's been less than 24 hours so I'd like to let the team get a chance to look at this and analyze it completely. At this point it's our prime suspect but that doesn't necessarily mean it will pan out.

Even though we have had this anomaly, the vehicle is under control. We lost a sol of operations, because when this occurred we stopped the uplink for that sol. We have the vehicle under control, we understand the problem, we don't know the root cause, but we've taken preventive measures to make sure it's still functional without risking a problem.

It's much less scary {than Spirit sol 18} but I'll feel a lot better when we know exactly what's going on. All these things are scary to one degree or another. I'd rather have this problem though; not hearing from a vehicle is disconcerting.
--Emily


--------------------
My website - My Patreon - @elakdawalla on Twitter - Please support unmannedspaceflight.com by donating here.
Go to the top of the page
 
+Quote Post
 
Start new topic
Replies
Deimos
post Jun 19 2008, 12:04 AM
Post #2


Martian Photographer
***

Group: Members
Posts: 353
Joined: 3-March 05
Member No.: 183



The current problem is within an engineering APID, 40. The APID structure and use is similar to MPF, but there are some interesting nuances. There are also more APIDs for both engineering and science (SSI has 13). Flash size motivated the science team to ask for more APIDs. There is a downlink priority table (DPT) in use on any given downlink. There is also a nighttime priority table that is used when saving to flash. Higher priority APIDs get saved first, lower ones may not make it. So APIDs map a 2-D space: How urgent is it that we get the data soon? How important is it that we never lose the data?

You can imagine a few kinds of data. An image of the dig we just did may be key for the next planning cycle, thus it has to be high on the DPT. But, if we somehow didn't get it (think electra), we could as easily reaquire it as save it overnight. So maybe it is low on the NPT (in practice, this specific example tends to be high in both). Or a RAC image of a sample in the scoop just before be deliver that sample: you may not need that image to plan the next sol, but you can never take it again. So, high in the NPT, maybe not in the DPT. A TEGA or WCL run ends up being very high in the NPT; they may also be high in the DPT if, for example, a follow-on TEGA ramp is desired the next sol. And, many things are not urgent and can also be redone. An image of some rock several meters away: if it falls out of flash, just take the picture again. So, for every product generated, a decision has to be made on both urgency and the need to save the data--then APIDs are assigned.

In strategic planning, the data is that is neither urgent nor critical to save to flash (especially SSI_LOW) has gotten the nickname "red-shirt" data, and is always vulnerable to loss in the event of even minor problems. Actually we've only lost it a few times though.

A further complication is "sent" data. If the data were specifically for tactical planning, you could treat it as "fire and forget". If the data is a TEGA bake, you cannot. What if the data are lost in transmission and need resending? Thus, the most important "sent" data trumps the least important unsent data (the red-shirts) when saving to flash.

And just when you thought I'd be out of further complications ... what if we could use MRO to get an extra 30-40 Mb of data? But, what if we new there was a larger risk of losing that data compared to the (now normal) ODY passes? You want to take and send the extra data; but you cannot afford to send urgent or critical data the risky way. Send in the red-shirts. So sometimes the lower priority stuff comes down at 2 PM (Mars time) while the more urgent stuff waits until 4 PM (and the first ODY pass).

How close a resemblance does this bear to MER file management? Well, just about none. There are files ... they are managed ... that's about it. Actually, on MER files are managed, on PHX APIDs are managed. MER data have priorities that can be dynamically reassigned (as opposed to moving whole APIDs around) and do not use APIDs for prioritization. MER priorities are for both saving and downlink, and MER is managed to avoid most "auto-deletes" when there is more data than flash. Phoenix cannot be managed that way, since we usually have more downlink in a sol than flash capacity, before even worrying about sent data that needs protecting.
Go to the top of the page
 
+Quote Post
Cargo Cult
post Jun 19 2008, 03:18 PM
Post #3


Junior Member
**

Group: Members
Posts: 36
Joined: 28-May 08
Member No.: 4152



QUOTE (Deimos @ Jun 19 2008, 02:04 AM) *
How close a resemblance does this bear to MER file management? Well, just about none. There are files ... they are managed ... that's about it. Actually, on MER files are managed, on PHX APIDs are managed.

Out of (somewhat nerdy) interest, which operating system (if any) is Phoenix running? I'm sure I read an article somewhere about it being something other than VxWorks as used by the rovers, but I can't remember what exactly it was.

(For everyone else, there's an interesting article here about Spirit's problems - essentially the number of files on flash grew to require more memory than the filesystem module could allocate, forcing the system to reboot, only to try to mount that filesystem again...)

I had a weird sense of mental inversion last night, where Phoenix and friends stopped being space probes with computers inside them, to being computers with space probes built around them. All my laptop asks is - can it go to Mars too? ;-)

Go to the top of the page
 
+Quote Post
mcaplinger
post Jun 19 2008, 04:51 PM
Post #4


Senior Member
****

Group: Members
Posts: 2559
Joined: 13-September 05
Member No.: 497



QUOTE (Cargo Cult @ Jun 19 2008, 08:18 AM) *
Out of (somewhat nerdy) interest, which operating system (if any) is Phoenix running? I'm sure I read an article somewhere about it being something other than VxWorks as used by the rovers, but I can't remember what exactly it was.

Google is your friend. http://blogs.windriver.com/deliman/2008/05...ou-watch-i.html confirms that Phoenix uses VxWorks 5.2.

As for SEUs, the RAD6000 is not very subject to SEUs: http://www.baesystems.com/BAEProd/groups/p..._eis_sfrwre.pdf says 7.4e-10 errors/bit-day in 90% worst-case GEO. Of course, each system costs about a million dollars IIRC.


--------------------
Disclaimer: This post is based on public information only. Any opinions are my own.
Go to the top of the page
 
+Quote Post
lastof7
post Jun 19 2008, 11:17 PM
Post #5


Newbie
*

Group: Members
Posts: 8
Joined: 7-January 07
Member No.: 1568



QUOTE (mcaplinger @ Jun 19 2008, 12:51 PM) *
Google is your friend. http://blogs.windriver.com/deliman/2008/05...ou-watch-i.html confirms that Phoenix uses VxWorks 5.2.


As a side note, a good example of what can sometimes make spacecraft software difficult. 5.2 was released around '95, I think? The RAD6000 can go up to at least 5.3.1, but VxWorks is now up to 6.6 or so. Newer boards such as the RAD750, the LEON3, etc. reach into the 6.x range, but you're still usually a few revs (along with the corresponding features and bug fixes) behind.
Go to the top of the page
 
+Quote Post
mcaplinger
post Jun 20 2008, 03:25 AM
Post #6


Senior Member
****

Group: Members
Posts: 2559
Joined: 13-September 05
Member No.: 497



QUOTE (lastof7 @ Jun 19 2008, 03:17 PM) *
...you're still usually a few revs (along with the corresponding features and bug fixes) behind.

The core set of VxWorks functionality is so small that I don't know that we're missing that much. Sometimes I'd be happier if they didn't keep "upgrading" things.


--------------------
Disclaimer: This post is based on public information only. Any opinions are my own.
Go to the top of the page
 
+Quote Post

Posts in this topic
- elakdawalla   Sol 22 anomaly   Jun 18 2008, 09:39 PM
- - jmjawors   That little blurb caught my eye as well. Thanks f...   Jun 18 2008, 09:50 PM
- - climber   An info from AW&ST, june 9th, Craig Covault pa...   Jun 18 2008, 10:06 PM
- - mcaplinger   QUOTE (elakdawalla @ Jun 18 2008, 01:39 P...   Jun 18 2008, 10:35 PM
- - helvick   Emily, My understanding is that each spacecraft [...   Jun 18 2008, 10:38 PM
- - jekbradbury   Why is nobody looking on the bright side? We get ...   Jun 18 2008, 10:45 PM
- - Deimos   The current problem is within an engineering APID,...   Jun 19 2008, 12:04 AM
|- - ugordan   QUOTE (Deimos @ Jun 19 2008, 02:04 AM) In...   Jun 19 2008, 07:34 AM
|- - Cargo Cult   QUOTE (Deimos @ Jun 19 2008, 02:04 AM) Ho...   Jun 19 2008, 03:18 PM
||- - PaulM   QUOTE (Cargo Cult @ Jun 19 2008, 04:18 PM...   Jun 19 2008, 04:25 PM
||- - mcaplinger   QUOTE (Cargo Cult @ Jun 19 2008, 08:18 AM...   Jun 19 2008, 04:51 PM
||- - lastof7   QUOTE (mcaplinger @ Jun 19 2008, 12:51 PM...   Jun 19 2008, 11:17 PM
||- - mcaplinger   QUOTE (lastof7 @ Jun 19 2008, 03:17 PM) ....   Jun 20 2008, 03:25 AM
||- - lastof7   It's mostly the bug fixes along with the fact ...   Jun 20 2008, 03:53 AM
||- - Greg Hullender   QUOTE (lastof7 @ Jun 19 2008, 08:53 PM) I...   Jun 20 2008, 03:28 PM
|||- - lastof7   QUOTE (Greg Hullender @ Jun 20 2008, 11:2...   Jun 20 2008, 05:59 PM
|||- - Reed   QUOTE (Greg Hullender @ Jun 20 2008, 07:2...   Jun 20 2008, 08:42 PM
|||- - Greg Hullender   QUOTE (Reed @ Jun 20 2008, 12:42 PM) the ...   Jun 21 2008, 11:24 PM
||- - PaulM   QUOTE (lastof7 @ Jun 20 2008, 04:53 AM) I...   Jun 23 2008, 11:45 AM
|- - hendric   QUOTE (Deimos @ Jun 18 2008, 06:04 PM) Th...   Jun 19 2008, 07:34 PM
|- - helvick   45000 data items is a lot of data. Now you can gen...   Jun 19 2008, 11:34 PM
- - ElkGroveDan   Good catch Gordan. More on red shirts here.   Jun 19 2008, 01:33 PM
|- - MahFL   I loved the series Startrek, you always waited wit...   Jun 19 2008, 01:57 PM
- - glennwsmith   Hendric, you're idea of a debug message as the...   Jun 19 2008, 10:45 PM
- - Airbag   One model of SUV has a badge on the back that stat...   Jun 20 2008, 05:07 PM
- - imipak   No need for mad RE skillz; just license it from MS...   Jun 22 2008, 07:57 PM
- - elakdawalla   An update from Barry Goldstein that I understand a...   Jun 24 2008, 05:08 PM
|- - PaulM   QUOTE (elakdawalla @ Jun 24 2008, 06:08 P...   Jun 24 2008, 06:29 PM
|- - djellison   QUOTE (PaulM @ Jun 24 2008, 07:29 PM) . F...   Jun 24 2008, 06:40 PM
|- - mcaplinger   QUOTE (djellison @ Jun 24 2008, 10:40 AM)...   Jun 25 2008, 01:51 AM
|- - PaulM   QUOTE (mcaplinger @ Jun 25 2008, 02:51 AM...   Jun 25 2008, 11:31 AM
- - mcaplinger   Presumably people who are interested in flight sof...   Jul 11 2008, 03:50 AM
- - nprev   Interesting article; thanks for posting it, Mike...   Jul 11 2008, 04:22 AM


Reply to this topicStart new topic

 



RSS Lo-Fi Version Time is now: 17th December 2024 - 05:01 AM
RULES AND GUIDELINES
Please read the Forum Rules and Guidelines before posting.

IMAGE COPYRIGHT
Images posted on UnmannedSpaceflight.com may be copyrighted. Do not reproduce without permission. Read here for further information on space images and copyright.

OPINIONS AND MODERATION
Opinions expressed on UnmannedSpaceflight.com are those of the individual posters and do not necessarily reflect the opinions of UnmannedSpaceflight.com or The Planetary Society. The all-volunteer UnmannedSpaceflight.com moderation team is wholly independent of The Planetary Society. The Planetary Society has no influence over decisions made by the UnmannedSpaceflight.com moderators.
SUPPORT THE FORUM
Unmannedspaceflight.com is funded by the Planetary Society. Please consider supporting our work and many other projects by donating to the Society or becoming a member.