Printable Version of Topic

Click here to view this topic in its original format

Unmanned Spaceflight.com _ Spirit _ NASA rover reboots twice over Easter weekend

Posted by: Deimos Apr 14 2009, 12:47 AM

http://www.theregister.co.uk/2009/04/14/nasa_reboot_over_easter_weekend/

Moderator adding: the JPL press release is http://marsrovers.jpl.nasa.gov/newsroom/pressreleases/20090413a.html.

Posted by: monty python Apr 14 2009, 06:04 AM

One hopes that Spirit isn't seeing some wear and tear from the computer doing very many reboots at the start of its mission. Didn't Spirit just have some computer issues before the latest software upload?

And how does this effect Oppy. Does it stand down to see if a common software bug could effect it?

Brian

Posted by: Tesheiner Apr 14 2009, 07:36 AM

> One hopes that Spirit isn't seeing some wear and tear from the computer doing very many reboots at the start of its mission.

Huh? I can't see how reboots may cause wear and tear to the computer but perhaps the opposite. Wear and tear to the computer causing reboots.

> Didn't Spirit just have some computer issues before the latest software upload?

See here: http://marsrovers.jpl.nasa.gov/mission/status_spiritAll.html#sol1797

Edited:
> And how does this effect Oppy. Does it stand down to see if a common software bug could effect it?

Just checked today's imaging plan for Opportunity and it has all signs of a driving sol.

Posted by: tedstryk Apr 14 2009, 11:55 AM

I don't think the early boots would have done permanent damage to the computer. Brian, shouldn't you be looking on the bright side of life? smile.gif

Posted by: Phil Stooke Apr 14 2009, 12:34 PM

"Brian, shouldn't you be looking on the bright side of life? "

Good one, Ted

Phil

Posted by: RoverDriver Apr 14 2009, 02:06 PM

QUOTE (monty python @ Apr 13 2009, 10:04 PM) *
One hopes that Spirit isn't seeing some wear and tear from the computer doing very many reboots at the start of its mission. Didn't Spirit just have some computer issues before the latest software upload?

And how does this effect Oppy. Does it stand down to see if a common software bug could effect it?

Brian


I am on vacation this week (Spring Break with my kids in the PNW) so I do not know what's going on on at Gusev. I know Opportunity is driving (forwards!!).

Related to computer booting: I don't think that adds wear and tear. I know of a company that built an empire around computer rebooting. smile.gif

Paolo

Posted by: helvick Apr 14 2009, 03:59 PM

I don't think reboots should affect much but Flash memory does degrade with use. It takes a while but we are running into fairly large data volumes for the lifetime of the rovers. I'm pretty sure that the type of Flash memory used in the MER's is good for around 100k write cycles per cell but five years with a few tens of GB of data throughput in the relatively harsh environment of the Martian surface might be enough to start seeing more frequent transient errors if there was any significant "hotspot" on the Flash drive that was getting a lot more write activity than the average. However I suspect that if this was the root cause Opportunity would be more likely to exhibit the problem as I'm pretty sure she has delivered more data - and given the use of deep sleep mode any wear that was related to the boot process should also hit Opportunity sooner than Spirit since the former has made much more use of that than Spirit IIRC.

Here's hoping it was just some freak occurrence of cosmic ray hits.

Posted by: stevesliva Apr 14 2009, 04:41 PM

QUOTE (helvick @ Apr 14 2009, 11:59 AM) *
I don't think reboots should affect much but Flash memory does degrade with use. It takes a while but we are running into fairly large data volumes for the lifetime of the rovers. I'm pretty sure that the type of Flash memory used in the MER's is good for around 100k write cycles per cell but five years with a few tens of GB of data throughput in the relatively harsh environment of the Martian surface might be enough to start seeing more frequent transient errors if there was any significant "hotspot" on the Flash drive that was getting a lot more write activity than the average.


Even if the memory doesn't use the algorithms that balance write cycling (and flash architecture usually needs only balance by sectors or pages or whatever the minimum memory chunk is that can be erased before reprogramming, not by individual bit), it's worth bearing in mind that, like a rover with a 90-day guarantee, each individual flash cell has a 100k (or more) guarantee, but the average flash cell will achieve far more than that. And if there is overhead in the ECC, a single bad bit isn't going to kill the word.

Posted by: ugordan Apr 14 2009, 05:28 PM

QUOTE (stevesliva @ Apr 14 2009, 06:41 PM) *
And if there is overhead in the ECC, a single bad bit isn't going to kill the word.

Slightly related to this, while browsing through the recent Cassini PDS release info, I noticed they detected a bad spot in one of its SSRs causing double-bit errors (so they're not caught) and various kinds of corruption in the ISS images. They were planning on developing a SW patch to avoid the bad segment.

Memory corruption sucks, doesn't it?

Posted by: stevesliva Apr 14 2009, 07:06 PM

QUOTE (ugordan @ Apr 14 2009, 01:28 PM) *
Memory corruption sucks, doesn't it?

Yeah, there's nothing worse than field returns in my business. Hard to bring 'em back from space, though. Sure it's not merely uncorrectable? I'd guess double-bit errors are detectable. (But not necessarily.) It is a good point, though, that once you have an always-bad bit, your overhead *is* shot, and your transient errors will be uncorrectable in all likelihood.

Posted by: tanjent Apr 16 2009, 04:31 PM

QUOTE (stevesliva @ Apr 15 2009, 12:41 AM) *
...it's worth bearing in mind that, like a rover with a 90-day guarantee, each individual flash cell has a 100k (or more) guarantee, but the average flash cell will achieve far more than that. And if there is overhead in the ECC, a single bad bit isn't going to kill the word.


I don't know whether to interpret the 100K cycles "guarantee" as a minimum, mean, median, or even a modal value. But surely the second moment must be important in this sort of problem. If the sigma for expected failure is wide enough around say, 100K, and if a particular programming operation "samples" from say 20KB (WAG) worth of cells, the chance of the program crashing must become significant long before the average cell accumulates 100K read/writes. (I take it ECC refers to some kind of error correction, which probably can catch and correct the early failures if they are rare enough.)

Posted by: stevesliva Apr 16 2009, 06:57 PM

If the program is merely reading from those cells, its not an issue. Just writing. So you could use flash as instruction memory that you might update a few times in a mission, and you can use it as a storage repository for photos. Even if you filled the flash every sol, we're not at 2000 yet. What you cannot use it as, is RAM-- a scratchpad for doing calculation.

Yes, the ECC is there (if it's there) to correct errors in the memory word. For 128 bits, you might write a 16 extra syndrome bits that algorithmically would allow you to correct a single bit in the 128 that is wrong. To my knowledge, the ECC isn't there to correct the hard errors that come with exceeding write cycling, it's there to correct for errors that just happen on occasion, in fantastically mind-boggling, flash-specific ways. But it would help cover up hard errors.

To guarantee 100K cycles, you have to bear in mind that, yes, you might be making this guarantee for over 16 billion cells on a 16Gb chip. So if your guarantee for your typical statistical cell meets that to even 10 sigma--or whatever one in a billion cells not meeting the spec would mean--you're still going to get fails on that chip. What they do to spec 100K would be a combination of test (throw out entire bad chips), redundant cells and repair (find the bad bits and fix them... how you find suspect bits without destroying a chip-- top secret), and the aforementioned ECC if your process engineers can't totally solve this particular problem. And yeah, you might still get a cell in an iPod somewhere that goes bad before its time, but the stats guys are trying really hard to ensure that that is extremely rare by eliminating the tail of the distribution. My point was that the actual center of the distribution is still going to be somewhere far far above 100K to make this guarantee.

Just delivering a memory chip that works from Day 0 is a similar game of stats... even if your process engineers deliver a process where only one in a million cells is failing a spec, every single 1Gb chip would have on average 1000 bad cells! So after manufacturing, there is a lot of test to be done to fix things and eliminate those fliers. At the same time, there are 900 million cells that greatly exceed the spec.

Posted by: fredk Apr 17 2009, 03:34 AM

A few details on the Spirit anomalies in http://marsrovers.jpl.nasa.gov/mission/status_spiritAll.html#sol1872

Posted by: Nomadd22 Apr 18 2009, 08:10 PM

Guarantee is not a term you'd use for a Martian rover. It's a business term. The chips would probably have something like a mean time between, or before failure rate. Age can be a factor, since almost all mechanical failures are from thermal cycles. And of course you have random failures that are as likely day one as day five thousand.
I'd guess they could map out physical bit failures in memory, but don't really know if that was included.

Posted by: ilbasso Apr 19 2009, 01:03 AM

QUOTE (fredk @ Apr 16 2009, 10:34 PM) *
A few details on the Spirit anomalies in http://marsrovers.jpl.nasa.gov/mission/status_spiritAll.html#sol1872


Can someone please explain in clearer English this extract from the above-referenced update: "no sol number for Spirit corresponded to April 2, 2009, using the criterion of the date in Los Angeles at local solar noon on Mars"?

Thanks,
Jonathan

Posted by: tfisher Apr 19 2009, 01:35 AM

QUOTE (ilbasso @ Apr 18 2009, 08:03 PM) *
Can someone please explain in clearer English this extract from the above-referenced update: "no sol number for Spirit corresponded to April 2, 2009, using the criterion of the date in Los Angeles at local solar noon on Mars"?


A solar day on mars is a bit longer than a day on earth -- about 24 hours 39 minutes (per wikipedia -- http://en.wikipedia.org/wiki/Timekeeping_on_Mars). When they convert Martian sols to Earth dates, they are saying they take the date as the date in Los Angeles at the time of solar noon for the rover. With the slightly longer Martian day, it must have worked out that noon on sol 1865 landed just before midnight on April 1st while noon on sol 1866 landed in the wee hours of the morning on April 3, twenty-four and a half hours later (or so).

--Travi

Posted by: nprev Apr 19 2009, 01:56 AM

Travi, thanks. Had to read that twice before I got it, but makes sense.

Posted by: centsworth_II Apr 19 2009, 08:23 PM

QUOTE (tfisher @ Apr 18 2009, 08:35 PM) *
A solar day on mars is a bit longer than a day on earth -- about 24 hours 39 minutes...

Since an Earth day has 1440 minutes and a Mars day is 39 minutes longer than an Earth day, there are about 38 Earth days for every 37 Mars sols.* So it makes mathematical sense that when corresponding Earth days to Martian sols, a day needs to be "lost" every 37 sols.

*1440/39 = 36.92 (That's the extent of my math prowess, no orbital calculations for me! laugh.gif )

Posted by: djellison Apr 20 2009, 06:31 PM

Speaking of reboots - the raw image pipeline just flushed smile.gif

Posted by: Sunspot Apr 20 2009, 06:34 PM

...still alive then?

Posted by: djellison Apr 20 2009, 07:29 PM

No commanded remote sensing, but downlinking of older data is occurring on the PCDB

Posted by: Tesheiner Apr 20 2009, 07:38 PM

QUOTE (djellison @ Apr 20 2009, 08:31 PM) *
Speaking of reboots - the raw image pipeline just flushed smile.gif

Ahh! Now I can have my mars fix! smile.gif

Posted by: mhoward Apr 20 2009, 08:07 PM

Well, to look on the bright side: looks like these kind of delays mean the rovers have time to send back plenty of old navigation images.

Posted by: lyford Apr 21 2009, 02:32 AM

Speaking of delays....
http://www.theonion.com/content/news/nasa_embarks_on_epic_delay?utm_source=a-section


Posted by: mhoward Apr 21 2009, 02:45 AM

The news is... http://www.jpl.nasa.gov/news/news.cfm?release=2009-071 (Via http://twitter.com/jetlab)

Posted by: nprev Apr 21 2009, 02:56 AM

Hmm indeed. Hope we're not heading into the land of complex/unusual failure modes that systems advanced in years too often enter. (The last years of both the F-4 & the C-141 were often quite bizarre in this regard...)

Posted by: Tesheiner Apr 21 2009, 09:11 PM

Emily posted updated information in the Planetary Society blog: http://www.planetary.org/blog/article/00001916/

Posted by: BrianL Apr 21 2009, 10:39 PM

Speaking of failing brains, I read that 3 times before I parsed anything but "Emily Post updated the information..." blink.gif


Powered by Invision Power Board (http://www.invisionboard.com)
© Invision Power Services (http://www.invisionpower.com)