BLOG
Long-Term System Autonomy
by Robert Freeland II
Introduction
In the thirty years since the original Daedalus papers were published, mankind has made huge strides in most areas of technology relevant to the project. We have more computing power in our cell phones now than even the biggest super-computers could muster back then; we have discovered new materials with exotic properties; we seem to be on the brink of controlled, exoenergetic fusion; and we’re discovering new extra-solar gas giants almost monthly. It’s easy to suppose that anything which could be envisioned by the Daedalus team in 1978 is now easily accomplished. There is one aspect of the Daedalus design, though, that still seems confounding even by today’s standards: the requirement that the ship survive without human intervention for upwards of fifty years. The original Daedalus team calculated that the ship would experience a failure of some sort on average every hour. It was easy to see that such a failure rate could not be addressed via redundancy because the mass of the ship would become enormous. The Daedalus team instead proposed an advanced automated repair system which was capable of identifying problems, devising solutions, and implementing those solutions. Critical to this design was a pair of fully autonomous repair droids (“wardens”) which were able to access all parts of the ship, interior and exterior. My colleague Philipp Reiss has recently published a paper on this blog discussing the motivation for automated repair and some of the challenges associated with its implementation. I wanted to take a step back and examine the requirements for such a system to succeed.
Mathematics of Failure
The original Daedalus team estimated that something on Daedalus would fail approximately every hour for the fifty-year duration of the journey. That’s 50*365*24 = 438,000 individual failures that have to be “handled” over the lifetime of the journey. Assuming that just half of those failures are “critical” — and that the wardens’ response to any single failure is 99.9% accurate — the probability of overall mission success is then: .999 ^ (438,000/2) = 7E-94%. That’s essentially zero. Assuming instead that the wardens have a six-sigma success rate (99.99966%) with all repairs, the probability of mission success is still just 47.5%. I can’t even begin to imagine an intelligence (human or otherwise) that can handle hundreds of thousands of random, disparate failures with a six-sigma success rate. That really is extraordinary. (In fact, the percentage above is “six sigma” in the corporate parlance, which is actually only 4.5 sigma.) The inescapable conclusion is that we need a much lower mean time between failures (MTBF) system-wide for the mission to be possible. To take it a step further, we can express a generic formula for the relationship among the MTBF (in days), a criticality factor (F), the wardens’ success rate in handling failures (WSR), the mission duration in years (D) and the overall mission success rate (MSR): MSR = WSR ^ (F*D*365 / MTBF) We can also invert this calculation. Given a target rate (MSR) for overall mission success, and given a certain mission duration (D) and a certain criticality rate (F), we can pick some target MTBF values and WSR values with which we’re comfortable: WSR = e^ [ ln(MSR) / (F*D*365 / MTBF) ] Given MSR = 99%, D=50 years, and F=50%, it seems to me that the only values that make sense involve a MTBF that’s really low, on the order of one failure per year, with only one critical failure every other year. WSR = e^ [ ln(99%) / (50%*365*50 / 365) ] WSR = 99.96% success rate That is to say, for the overall mission to be 99% certain of success, the ship has to experience no more than 25 critical failures over the entire course of the mission, and the warden / repair system has to handle each failure with a 99.96% success rate. Reducing the desired success rate for the overall mission to 80% can reduce the required accuracy of the wardens by a factor of ten: WSR = e^ [ ln(80%) / (50%*365*50 / 365) ] WSR = 99.11% success rate Nevertheless, this is still a rather daunting set of requirements. Firstly, we must construct a ship capable of surviving for over a year between failures. Then we must design a fully autonomous system that is capable of identifying problems, devising solutions, and implementing them with a 99% success rate. Even a human team with unlimited resources would have difficulty accomplishing 25 repairs over 50 years with only a 1% chance of failure on each one. Repairs aboard the International Space Station have demonstrated this.
Analogies
It is difficult to name existing man-made systems (mechanical, electronic, or computerized) that have run for decades without human intervention. The latter criterion is really the kicker — most machines that we build are either maintained ongoing or fall rapidly into disrepair. Most really aren’t designed to maintain themselves. Analogous systems are most likely found in environments that discourage human intervention. Space: – Previous Probes: As Philipp Reiss states in his recent blog article, “The maximum lifetime of the Pioneer probes is 31 years (Pioneer 10) and 22 years (Pioneer 11). For the Voyager probes 1 and 2 it is currently 33 years and expected to become up to 48 years until the end of their functionality in the 2020’s.” I don’t really know how much human interaction (in the form of uploaded code, new instructions, etc.) was involved with these probes post-launch, but several have clearly been “out there” for decades without any physical maintenance. That’s promising. – Satellites: Many satellites have also been in orbit for decades, but again I don’t know how much human interaction they receive from the ground. Few receive physical repairs. – Space Shuttle & International Space Station: These require constant human interaction, and they still seem to suffer a lot of problems that require new parts to be sent up from Earth. So they’re not really good examples at all. Nautical: – Lighthouses & Marks: Most remaining lighthouses are unattended, as are all of the lighted marks along the U.S. coasts. These devices generally consist of a light, a timer, a battery, a solar panel, and a support structure. Although these are unattended, the U.S. Coast Guard does maintain them — but I have no idea what the mean time between failures is. The newer lights use LEDs that last for decades without problems, but I reckon the batteries, at least, have to be replaced every few years. – Underwater Sonar Arrays: The U.S. Government has set sonar sensors all over the Atlantic and Pacific ocean floors to track ship movement along America’s coasts. I have no idea what kind of maintenance this system requires, but it stands to reason that such maintenance would be very difficult, and thus is presumably rather infrequent. Underground: – Oil Wells: I remember driving through Texas when I was a teenager and seeing miles and miles of oil wells pumping away in an otherwise desolate landscape. The things certainly gave the impression that they were running unattended, but I imagine that someone must have been checking on them periodically. How often? Today, I’ll bet they’re all monitored 24×7 via remote sensors. Still, it’s an interesting analogy. Archaeological: – Roman Aqueducts: Some of these have been carrying water for close to 2000 years. I’m sure that the farmers who benefit from these aqueducts have maintained them over those years, but it’s certainly reasonable to assume that some aqueducts have gone for decades without any maintenance. It’s difficult to come up with other archaeological examples because most of what we find are just structures rather than machines. The aqueducts, in fact, just barely qualify as simple machines (essentially just ramps). Medical: – Pacemakers routinely run for a decade or two without any intervention. – There’s at least one case of a vascular assist device that lasted for 25 years before the patient died of unrelated causes. Random Electronics: – Casio Watch: I have a Casio digital diving watch that I bought back in 1985. Aside from replacing the band every few years, the only maintenance I’ve provided is a new battery every 7 years. So not counting the band, that’s three “incidents” in 25 years — not bad. – Casio Calculator: I have a solar-assisted Casio calculator that I bought back in 1983. I have replaced the battery in it exactly once in 27 years, and that was maybe 7 years ago, so it went a good 20 years with no maintenance whatsoever. – Adam Crowl noted on our forums that a lot of electronic components used by the U.S. Department of Defense have a MTBF of ~10 million hours or so. “The real trick is combining them and getting a MTBF for the whole to be a significant fraction of the individual.” Those two Casios are the only self-contained electronic devices that I’ve had for decades and that still work. Most everything else plugs into a wall jack, meaning that they’ve been “down for maintenance” every time I’ve moved. Few of those items (toasters, microwaves, coffee makers, TVs, etc.) have even remained working for the intervening decades.
Conclusion
For the Icarus probe to succeed, it must be designed and constructed in such a way that failures happen on the order of years apart. Moreover, the probe must include a completely autonomous system that is capable of addressing these failures (whether via redundancy or repair) with a better than 99% success rate. These are extraordinary requirements that stretch the limits of what mankind has thus far achieved, and may in fact represent the single biggest hurdle to the actual launch of an Icarus-type mission.

Thank you for this interesting post. But how is the mean time between failures of 1 hour evaluated? Would all these failures need immediate maintenance? If not, it could be possible to imagine ground support from earth even with a time delay of several years. For instance, if during an on-flight test, a flaw is detected in the GNC software dedicated to the final orbit insertion, it would possible to correct the code and upload it before the arrival. I know, it sounds a bit far-fetched but, considering the constraints on the maintenance system, some help from earth would be welcome.
I second destop’s first question. What was the REASON for the every-one-hour failure rate? A space shuttle doesn’t usually get repaired during a 12-day mission. Would Icarus really have 17,280 TIMES more parts than an STS system? Unlikely. Rather, if there is any logical basis to the every-one-hour failure rate it probably has to do with the fact that it is traveling so fast and so runs into a micrometeorite once every hour or that the calculated radiation flux would cause failures of small component every hour. If this is the case then maybe if we could find a better shielding solution then the failure rate would go down.
Second, there’s still this issue of how long will be the acceleration phase? How long will the engines be running? If you are discharging fusion pellets then I could imagine a lot of constantly running mechanical equipment that would need to function properly for extended periods of time. But between when the engine stops and when you enter the target star system what does the craft need to be doing? Course corrections, perhaps – but not continuously. The only thing which needs to run throughout the duration of the flight would be a clock and maybe heating elements.
As for computer circuits, memory, and programs, I could imagine triple redundant FGPAs which would periodically reprogram modular components of programs to those parts of circuit boards which have not been destroyed by cosmic rays, radiation flux, or secondary particles.
Hi John,
Thanks for your message. I’ll let Rob answer your question on failures. Regarding your question the duration of the acceleration phase, I think it’s important to remember that Project Icarus is only in month 10 of a 5 year project so details such as this are yet to be determined.
Richard
Hi destop, John,
When the Daedalus team completed its study 32 years ago, technology was very different. The Daedalus ship was far more complex than anything mankind had yet created, so the Daedalus team made an educated guess at the failure rates. It’s clear now, though, that failure rates this high are unsustainable, even with a “near-future” automated repair system. That’s the crux of my post.
Regarding ground-based human intervention: this will be possible in the early period of the mission (perhaps the first year or so), but once communication times are measured in months, I would argue that any failures addressable with this kind of lag are likely “non-critical”. We can envision scenarios where “critical” failures might be handled by ground control before they torpedo the mission, but these cases will tend to be outliers. I applied a rather arbitrary criticality factor of 50% to the analysis above, so errors in my guess at this factor will likely swamp any consideration of outlier cases.
As Richard noted, we don’t know yet how long the acceleration phase will last, but we’ll likely have distinct acceleration, coasting, and deceleration phases. The ship will be under less stress during the coasting phase, but it won’t be inert. We anticipate ongoing communication with Earth, and we hope to be able to carry out some extensive astronomic measurements. As the ship nears its target, early observations of planetary orbits will become critical. The shielding has to keep working, the fuel tanks have to remain pressurized, the power supplies have to keep working, optics have to remain clear, and the electronics have to function. Even during coasting, there are lots of opportunities for critical failures.
This analysis would be improved by breaking the mission into distinct phases (as noted above) and calculating failure rates for each phase, but that will have to come later once we have a better idea of what the phases are and how long each will last.
- Robert
Would you guys mind looking into the issue of what the main cause of such a high failure rate was? Without knowing this, any analysis might be completely irrelevant.
Daedalus’ one-every-hour failure rate estimate may have its origins in the unreliable electronics of the time. In the 1960s, some pessimists thought that _interplanetary_ probes would be impossible, because no complex electronic device could operate for 6 months or more without some component failing. This was before integrated circuits – they were thinking in terms of big boards covered in discrete resistors, transistors, etc. It’s easy to forget how far Moore’s Law has brought us since then, in reliability as well as miniaturization.
Re human intervention: this is not limited to the early part of the mission. A probe travelling at 0.1c has enough time to send a message to Earth, and get a reply before reaching its target, for 80% of its cruise phase. True autonomy is required only in the final stages. This is comparable to some outer-planet or asteroid missions: an encounter with a previously unknown body may unfold over a few hours, a few light-hours from Earth; so the spacecraft must conduct it without human help.
I reviewed the original Daedalus documents, and there’s a table in the “Engineering Assessment” document, Section #7 “Maintenance Liabilities” that addresses this question. The Daedalus team anticipated that “Data Management” would generate over half the total defects for the mission. Second is “Experiments” with 22%, and third (ironically) is “On-Board Repair” with 10%. So yes, I think that much of this is rooted in “the unreliable electronics of the time”, as Geoff puts it.
while you speak of failure rates of 1/yr. for the ship, what impact is there for failure rates of the ‘wardens’ since the’re far fewer of them ?