IPB

Welcome Guest ( Log In | Register )

2 Pages V   1 2 >  
Reply to this topicStart new topic
NASA rover reboots twice over Easter weekend
Deimos
post Apr 14 2009, 12:47 AM
Post #1


Martian Photographer
***

Group: Members
Posts: 352
Joined: 3-March 05
Member No.: 183



NASA rover resurrected twice over Easter weekend

Moderator adding: the JPL press release is here.
Go to the top of the page
 
+Quote Post
monty python
post Apr 14 2009, 06:04 AM
Post #2


Member
***

Group: Members
Posts: 184
Joined: 2-March 06
Member No.: 692



One hopes that Spirit isn't seeing some wear and tear from the computer doing very many reboots at the start of its mission. Didn't Spirit just have some computer issues before the latest software upload?

And how does this effect Oppy. Does it stand down to see if a common software bug could effect it?

Brian
Go to the top of the page
 
+Quote Post
Tesheiner
post Apr 14 2009, 07:36 AM
Post #3


Senior Member
****

Group: Moderator
Posts: 4279
Joined: 19-April 05
From: .br at .es
Member No.: 253



> One hopes that Spirit isn't seeing some wear and tear from the computer doing very many reboots at the start of its mission.

Huh? I can't see how reboots may cause wear and tear to the computer but perhaps the opposite. Wear and tear to the computer causing reboots.

> Didn't Spirit just have some computer issues before the latest software upload?

See here: Unexpected Behavior

Edited:
> And how does this effect Oppy. Does it stand down to see if a common software bug could effect it?

Just checked today's imaging plan for Opportunity and it has all signs of a driving sol.
Go to the top of the page
 
+Quote Post
tedstryk
post Apr 14 2009, 11:55 AM
Post #4


Interplanetary Dumpster Diver
****

Group: Admin
Posts: 4404
Joined: 17-February 04
From: Powell, TN
Member No.: 33



I don't think the early boots would have done permanent damage to the computer. Brian, shouldn't you be looking on the bright side of life? smile.gif


--------------------
Go to the top of the page
 
+Quote Post
Phil Stooke
post Apr 14 2009, 12:34 PM
Post #5


Solar System Cartographer
****

Group: Members
Posts: 10127
Joined: 5-April 05
From: Canada
Member No.: 227



"Brian, shouldn't you be looking on the bright side of life? "

Good one, Ted

Phil


--------------------
... because the Solar System ain't gonna map itself.

Also to be found posting similar content on https://mastodon.social/@PhilStooke
NOTE: everything created by me which I post on UMSF is considered to be in the public domain (NOT CC, public domain)
Go to the top of the page
 
+Quote Post
RoverDriver
post Apr 14 2009, 02:06 PM
Post #6


Member
***

Group: Admin
Posts: 976
Joined: 29-September 06
From: Pasadena, CA - USA
Member No.: 1200



QUOTE (monty python @ Apr 13 2009, 10:04 PM) *
One hopes that Spirit isn't seeing some wear and tear from the computer doing very many reboots at the start of its mission. Didn't Spirit just have some computer issues before the latest software upload?

And how does this effect Oppy. Does it stand down to see if a common software bug could effect it?

Brian


I am on vacation this week (Spring Break with my kids in the PNW) so I do not know what's going on on at Gusev. I know Opportunity is driving (forwards!!).

Related to computer booting: I don't think that adds wear and tear. I know of a company that built an empire around computer rebooting. smile.gif

Paolo


--------------------
Disclaimer: all opinions, ideas and information included here are my own,and should not be intended to represent opinion or policy of my employer.
Go to the top of the page
 
+Quote Post
helvick
post Apr 14 2009, 03:59 PM
Post #7


Dublin Correspondent
****

Group: Admin
Posts: 1799
Joined: 28-March 05
From: Celbridge, Ireland
Member No.: 220



I don't think reboots should affect much but Flash memory does degrade with use. It takes a while but we are running into fairly large data volumes for the lifetime of the rovers. I'm pretty sure that the type of Flash memory used in the MER's is good for around 100k write cycles per cell but five years with a few tens of GB of data throughput in the relatively harsh environment of the Martian surface might be enough to start seeing more frequent transient errors if there was any significant "hotspot" on the Flash drive that was getting a lot more write activity than the average. However I suspect that if this was the root cause Opportunity would be more likely to exhibit the problem as I'm pretty sure she has delivered more data - and given the use of deep sleep mode any wear that was related to the boot process should also hit Opportunity sooner than Spirit since the former has made much more use of that than Spirit IIRC.

Here's hoping it was just some freak occurrence of cosmic ray hits.
Go to the top of the page
 
+Quote Post
stevesliva
post Apr 14 2009, 04:41 PM
Post #8


Senior Member
****

Group: Members
Posts: 1578
Joined: 14-October 05
From: Vermont
Member No.: 530



QUOTE (helvick @ Apr 14 2009, 11:59 AM) *
I don't think reboots should affect much but Flash memory does degrade with use. It takes a while but we are running into fairly large data volumes for the lifetime of the rovers. I'm pretty sure that the type of Flash memory used in the MER's is good for around 100k write cycles per cell but five years with a few tens of GB of data throughput in the relatively harsh environment of the Martian surface might be enough to start seeing more frequent transient errors if there was any significant "hotspot" on the Flash drive that was getting a lot more write activity than the average.


Even if the memory doesn't use the algorithms that balance write cycling (and flash architecture usually needs only balance by sectors or pages or whatever the minimum memory chunk is that can be erased before reprogramming, not by individual bit), it's worth bearing in mind that, like a rover with a 90-day guarantee, each individual flash cell has a 100k (or more) guarantee, but the average flash cell will achieve far more than that. And if there is overhead in the ECC, a single bad bit isn't going to kill the word.
Go to the top of the page
 
+Quote Post
ugordan
post Apr 14 2009, 05:28 PM
Post #9


Senior Member
****

Group: Members
Posts: 3648
Joined: 1-October 05
From: Croatia
Member No.: 523



QUOTE (stevesliva @ Apr 14 2009, 06:41 PM) *
And if there is overhead in the ECC, a single bad bit isn't going to kill the word.

Slightly related to this, while browsing through the recent Cassini PDS release info, I noticed they detected a bad spot in one of its SSRs causing double-bit errors (so they're not caught) and various kinds of corruption in the ISS images. They were planning on developing a SW patch to avoid the bad segment.

Memory corruption sucks, doesn't it?


--------------------
Go to the top of the page
 
+Quote Post
stevesliva
post Apr 14 2009, 07:06 PM
Post #10


Senior Member
****

Group: Members
Posts: 1578
Joined: 14-October 05
From: Vermont
Member No.: 530



QUOTE (ugordan @ Apr 14 2009, 01:28 PM) *
Memory corruption sucks, doesn't it?

Yeah, there's nothing worse than field returns in my business. Hard to bring 'em back from space, though. Sure it's not merely uncorrectable? I'd guess double-bit errors are detectable. (But not necessarily.) It is a good point, though, that once you have an always-bad bit, your overhead *is* shot, and your transient errors will be uncorrectable in all likelihood.
Go to the top of the page
 
+Quote Post
tanjent
post Apr 16 2009, 04:31 PM
Post #11


Member
***

Group: Members
Posts: 214
Joined: 30-December 05
Member No.: 628



QUOTE (stevesliva @ Apr 15 2009, 12:41 AM) *
...it's worth bearing in mind that, like a rover with a 90-day guarantee, each individual flash cell has a 100k (or more) guarantee, but the average flash cell will achieve far more than that. And if there is overhead in the ECC, a single bad bit isn't going to kill the word.


I don't know whether to interpret the 100K cycles "guarantee" as a minimum, mean, median, or even a modal value. But surely the second moment must be important in this sort of problem. If the sigma for expected failure is wide enough around say, 100K, and if a particular programming operation "samples" from say 20KB (WAG) worth of cells, the chance of the program crashing must become significant long before the average cell accumulates 100K read/writes. (I take it ECC refers to some kind of error correction, which probably can catch and correct the early failures if they are rare enough.)
Go to the top of the page
 
+Quote Post
stevesliva
post Apr 16 2009, 06:57 PM
Post #12


Senior Member
****

Group: Members
Posts: 1578
Joined: 14-October 05
From: Vermont
Member No.: 530



If the program is merely reading from those cells, its not an issue. Just writing. So you could use flash as instruction memory that you might update a few times in a mission, and you can use it as a storage repository for photos. Even if you filled the flash every sol, we're not at 2000 yet. What you cannot use it as, is RAM-- a scratchpad for doing calculation.

Yes, the ECC is there (if it's there) to correct errors in the memory word. For 128 bits, you might write a 16 extra syndrome bits that algorithmically would allow you to correct a single bit in the 128 that is wrong. To my knowledge, the ECC isn't there to correct the hard errors that come with exceeding write cycling, it's there to correct for errors that just happen on occasion, in fantastically mind-boggling, flash-specific ways. But it would help cover up hard errors.

To guarantee 100K cycles, you have to bear in mind that, yes, you might be making this guarantee for over 16 billion cells on a 16Gb chip. So if your guarantee for your typical statistical cell meets that to even 10 sigma--or whatever one in a billion cells not meeting the spec would mean--you're still going to get fails on that chip. What they do to spec 100K would be a combination of test (throw out entire bad chips), redundant cells and repair (find the bad bits and fix them... how you find suspect bits without destroying a chip-- top secret), and the aforementioned ECC if your process engineers can't totally solve this particular problem. And yeah, you might still get a cell in an iPod somewhere that goes bad before its time, but the stats guys are trying really hard to ensure that that is extremely rare by eliminating the tail of the distribution. My point was that the actual center of the distribution is still going to be somewhere far far above 100K to make this guarantee.

Just delivering a memory chip that works from Day 0 is a similar game of stats... even if your process engineers deliver a process where only one in a million cells is failing a spec, every single 1Gb chip would have on average 1000 bad cells! So after manufacturing, there is a lot of test to be done to fix things and eliminate those fliers. At the same time, there are 900 million cells that greatly exceed the spec.
Go to the top of the page
 
+Quote Post
fredk
post Apr 17 2009, 03:34 AM
Post #13


Senior Member
****

Group: Members
Posts: 4246
Joined: 17-January 05
Member No.: 152



A few details on the Spirit anomalies in the new update.
Go to the top of the page
 
+Quote Post
Nomadd22
post Apr 18 2009, 08:10 PM
Post #14


Newbie
*

Group: Members
Posts: 13
Joined: 6-April 09
Member No.: 4720



Guarantee is not a term you'd use for a Martian rover. It's a business term. The chips would probably have something like a mean time between, or before failure rate. Age can be a factor, since almost all mechanical failures are from thermal cycles. And of course you have random failures that are as likely day one as day five thousand.
I'd guess they could map out physical bit failures in memory, but don't really know if that was included.
Go to the top of the page
 
+Quote Post
ilbasso
post Apr 19 2009, 01:03 AM
Post #15


Member
***

Group: Members
Posts: 753
Joined: 23-October 04
From: Greensboro, NC USA
Member No.: 103



QUOTE (fredk @ Apr 16 2009, 10:34 PM) *
A few details on the Spirit anomalies in the new update.


Can someone please explain in clearer English this extract from the above-referenced update: "no sol number for Spirit corresponded to April 2, 2009, using the criterion of the date in Los Angeles at local solar noon on Mars"?

Thanks,
Jonathan


--------------------
Jonathan Ward
Manning the LCC at http://www.apollolaunchcontrol.com
Go to the top of the page
 
+Quote Post

2 Pages V   1 2 >
Reply to this topicStart new topic

 



RSS Lo-Fi Version Time is now: 28th March 2024 - 09:16 PM
RULES AND GUIDELINES
Please read the Forum Rules and Guidelines before posting.

IMAGE COPYRIGHT
Images posted on UnmannedSpaceflight.com may be copyrighted. Do not reproduce without permission. Read here for further information on space images and copyright.

OPINIONS AND MODERATION
Opinions expressed on UnmannedSpaceflight.com are those of the individual posters and do not necessarily reflect the opinions of UnmannedSpaceflight.com or The Planetary Society. The all-volunteer UnmannedSpaceflight.com moderation team is wholly independent of The Planetary Society. The Planetary Society has no influence over decisions made by the UnmannedSpaceflight.com moderators.
SUPPORT THE FORUM
Unmannedspaceflight.com is funded by the Planetary Society. Please consider supporting our work and many other projects by donating to the Society or becoming a member.