Long-Term System Autonomy
by Robert Freeland II
In the thirty years since the original Daedalus papers were published, mankind has made huge strides in most areas of technology relevant to the project. We have more computing power in our cell phones now than even the biggest super-computers could muster back then; we have discovered new materials with exotic properties; we seem to be on the brink of controlled, exoenergetic fusion; and we’re discovering new extra-solar gas giants almost monthly. It’s easy to suppose that anything which could be envisioned by the Daedalus team in 1978 is now easily accomplished. There is one aspect of the Daedalus design, though, that still seems confounding even by today’s standards: the requirement that the ship survive without human intervention for upwards of fifty years. The original Daedalus team calculated that the ship would experience a failure of some sort on average every hour. It was easy to see that such a failure rate could not be addressed via redundancy because the mass of the ship would become enormous. The Daedalus team instead proposed an advanced automated repair system which was capable of identifying problems, devising solutions, and implementing those solutions. Critical to this design was a pair of fully autonomous repair droids (“wardens”) which were able to access all parts of the ship, interior and exterior. My colleague Philipp Reiss has recently published a paper on this blog discussing the motivation for automated repair and some of the challenges associated with its implementation. I wanted to take a step back and examine the requirements for such a system to succeed.
Mathematics of Failure
The original Daedalus team estimated that something on Daedalus would fail approximately every hour for the fifty-year duration of the journey. That’s 50*365*24 = 438,000 individual failures that have to be “handled” over the lifetime of the journey. Assuming that just half of those failures are “critical” — and that the wardens’ response to any single failure is 99.9% accurate — the probability of overall mission success is then: .999 ^ (438,000/2) = 7E-94%. That’s essentially zero. Assuming instead that the wardens have a six-sigma success rate (99.99966%) with all repairs, the probability of mission success is still just 47.5%. I can’t even begin to imagine an intelligence (human or otherwise) that can handle hundreds of thousands of random, disparate failures with a six-sigma success rate. That really is extraordinary. (In fact, the percentage above is “six sigma” in the corporate parlance, which is actually only 4.5 sigma.) The inescapable conclusion is that we need a much lower mean time between failures (MTBF) system-wide for the mission to be possible. To take it a step further, we can express a generic formula for the relationship among the MTBF (in days), a criticality factor (F), the wardens’ success rate in handling failures (WSR), the mission duration in years (D) and the overall mission success rate (MSR): MSR = WSR ^ (F*D*365 / MTBF) We can also invert this calculation. Given a target rate (MSR) for overall mission success, and given a certain mission duration (D) and a certain criticality rate (F), we can pick some target MTBF values and WSR values with which we’re comfortable: WSR = e^ [ ln(MSR) / (F*D*365 / MTBF) ] Given MSR = 99%, D=50 years, and F=50%, it seems to me that the only values that make sense involve a MTBF that’s really low, on the order of one failure per year, with only one critical failure every other year. WSR = e^ [ ln(99%) / (50%*365*50 / 365) ] WSR = 99.96% success rate That is to say, for the overall mission to be 99% certain of success, the ship has to experience no more than 25 critical failures over the entire course of the mission, and the warden / repair system has to handle each failure with a 99.96% success rate. Reducing the desired success rate for the overall mission to 80% can reduce the required accuracy of the wardens by a factor of ten: WSR = e^ [ ln(80%) / (50%*365*50 / 365) ] WSR = 99.11% success rate Nevertheless, this is still a rather daunting set of requirements. Firstly, we must construct a ship capable of surviving for over a year between failures. Then we must design a fully autonomous system that is capable of identifying problems, devising solutions, and implementing them with a 99% success rate. Even a human team with unlimited resources would have difficulty accomplishing 25 repairs over 50 years with only a 1% chance of failure on each one. Repairs aboard the International Space Station have demonstrated this.
It is difficult to name existing man-made systems (mechanical, electronic, or computerized) that have run for decades without human intervention. The latter criterion is really the kicker — most machines that we build are either maintained ongoing or fall rapidly into disrepair. Most really aren’t designed to maintain themselves. Analogous systems are most likely found in environments that discourage human intervention. Space: – Previous Probes: As Philipp Reiss states in his recent blog article, “The maximum lifetime of the Pioneer probes is 31 years (Pioneer 10) and 22 years (Pioneer 11). For the Voyager probes 1 and 2 it is currently 33 years and expected to become up to 48 years until the end of their functionality in the 2020’s.” I don’t really know how much human interaction (in the form of uploaded code, new instructions, etc.) was involved with these probes post-launch, but several have clearly been “out there” for decades without any physical maintenance. That’s promising. – Satellites: Many satellites have also been in orbit for decades, but again I don’t know how much human interaction they receive from the ground. Few receive physical repairs. – Space Shuttle & International Space Station: These require constant human interaction, and they still seem to suffer a lot of problems that require new parts to be sent up from Earth. So they’re not really good examples at all. Nautical: – Lighthouses & Marks: Most remaining lighthouses are unattended, as are all of the lighted marks along the U.S. coasts. These devices generally consist of a light, a timer, a battery, a solar panel, and a support structure. Although these are unattended, the U.S. Coast Guard does maintain them — but I have no idea what the mean time between failures is. The newer lights use LEDs that last for decades without problems, but I reckon the batteries, at least, have to be replaced every few years. – Underwater Sonar Arrays: The U.S. Government has set sonar sensors all over the Atlantic and Pacific ocean floors to track ship movement along America’s coasts. I have no idea what kind of maintenance this system requires, but it stands to reason that such maintenance would be very difficult, and thus is presumably rather infrequent. Underground: – Oil Wells: I remember driving through Texas when I was a teenager and seeing miles and miles of oil wells pumping away in an otherwise desolate landscape. The things certainly gave the impression that they were running unattended, but I imagine that someone must have been checking on them periodically. How often? Today, I’ll bet they’re all monitored 24×7 via remote sensors. Still, it’s an interesting analogy. Archaeological: – Roman Aqueducts: Some of these have been carrying water for close to 2000 years. I’m sure that the farmers who benefit from these aqueducts have maintained them over those years, but it’s certainly reasonable to assume that some aqueducts have gone for decades without any maintenance. It’s difficult to come up with other archaeological examples because most of what we find are just structures rather than machines. The aqueducts, in fact, just barely qualify as simple machines (essentially just ramps). Medical: – Pacemakers routinely run for a decade or two without any intervention. – There’s at least one case of a vascular assist device that lasted for 25 years before the patient died of unrelated causes. Random Electronics: – Casio Watch: I have a Casio digital diving watch that I bought back in 1985. Aside from replacing the band every few years, the only maintenance I’ve provided is a new battery every 7 years. So not counting the band, that’s three “incidents” in 25 years — not bad. – Casio Calculator: I have a solar-assisted Casio calculator that I bought back in 1983. I have replaced the battery in it exactly once in 27 years, and that was maybe 7 years ago, so it went a good 20 years with no maintenance whatsoever. – Adam Crowl noted on our forums that a lot of electronic components used by the U.S. Department of Defense have a MTBF of ~10 million hours or so. “The real trick is combining them and getting a MTBF for the whole to be a significant fraction of the individual.” Those two Casios are the only self-contained electronic devices that I’ve had for decades and that still work. Most everything else plugs into a wall jack, meaning that they’ve been “down for maintenance” every time I’ve moved. Few of those items (toasters, microwaves, coffee makers, TVs, etc.) have even remained working for the intervening decades.
For the Icarus probe to succeed, it must be designed and constructed in such a way that failures happen on the order of years apart. Moreover, the probe must include a completely autonomous system that is capable of addressing these failures (whether via redundancy or repair) with a better than 99% success rate. These are extraordinary requirements that stretch the limits of what mankind has thus far achieved, and may in fact represent the single biggest hurdle to the actual launch of an Icarus-type mission.