Ultra-large-scale systems: fundamentally different?
An emerging trend in software research is a focus on complex software systems. These systems are typically:
- decentralized: a lack of hierarchical control.
- display emergent behaviour: unexpected behaviour arising from unexpected interaction effects.
- subject to network effects: a rich mixture of human and computer agents.
- large-scale: millions of lines of code, thousands of agents.
An example of such a system is the stock market, or the US military’s future soldier program. Two major initiatives have been developed to focus on these systems: the Software Engineering Institute’s ULSS program and the Large-Scale Complex IT Systems (LSCITS) program in the UK. According to these programs, there is a “Software Problem”, and the tools and techniques we use today must be radically improved if they are to manage these new challenges.
But as Peter Norvig illustrates in this article, we have managed to scale from the half-million instruction programs of the 1970s to the hundred-million instruction programs of today. So why do we expect not to handle the billions of instructions in a ULS system? I’m not suggesting we build software perfectly today. There are many improvements possible. But I’m not sure I agree there is some fundamental challenge we cannot currently address.
One thing I think is important is the ability to fail gracefully. One lesson that seems to have been learned over the years (and the idea surfaces from the start in SE) is that iteration is fundamental. That means building systems such that failure is not fatal. Naturally, in some places that isn’t possible (medical devices perhaps). But even in these seemingly obvious examples, it should be possible to fail usefully: to alert someone to the problem, supporting remote software updates, or incorporating more adaptation capability.
The point of this post is to ask whether the statement: “Our ability to develop, maintain and manage such systems is falling behind the growth in their complexity.” has any evidence behind it. Standish reports need not apply.
I think the issue is not so much that these systems are “ultra” large (whatever “ultra” means), but that they are increasingly complex, and systems change in their behaviour and understandability with complexity. So—the argument goes—it’s not that we’re dealing with instructions that are orders of magnitude more numerous, but that they are supposed to serve a function in sociotechnical systems that are orders of magnitude more complex.
Jorge Aranda
2011 September 30 at 16:36
I don’t agree that this increasing complexity is anything *fundamentally* new. Arguably, someone from the 80s would be incredibly surprised at the complexity of software today – e.g., a fly-by-wire system in an airplane, even a portable phone that can direct me to the nearest Starbucks. I’m a software optimist: I think by and large software has dealt with some enormously challenging systems very capably.
Neil
2011 September 30 at 17:52
The challenge is not in size. I guess the fundamental difference is in real agent-orientation. What that means is that every agent is a locus of control, you don’t have a (hierarchical) centralized control and what actually matters is interaction among these “agents”. Therefore, traditional algorithms/design methodologies result conceptually inadequate.
Fabiano
2011 October 11 at 04:53
Someone should put a measurable theory in place then, and we can test it to see whether what you postulate is true. So far ULSS seems like a bunch of hand-waving and unchallenged assumptions. This is all too prevalent in academic research.
Neil
2011 October 11 at 08:59
The key challenge for LSCITS and ULS style initatives is to develop a science and engineering of software systems that enables the prediction, identification, management and troubleshooting of emergent behaviour.
The fundamental problem is that current software engineering techniques do not tend to cope well with emergence. In otherwords we are not as good as industry would like us to be at the predicting and coping with non-linear interactions.
Example problem 1: When ‘Company A’ scales up a distributed system from 10,000 to 100,000 nodes and it behaves in ways that their state-of-the-art models and simulations did not predict.
Example problem 2: When ‘Company B’ deployed a system at their Manchester office the users benefited from its functionality and it worked as expected. However when ‘Company B’ deployed the same system at several of other sites it was resisted and workers claimed that it did not fit the way their sites worked despite them having the same business processes as at Manchester.
David
2011 November 15 at 11:37
Isn’t unpredictability somewhat axiomatic in defining ULS? It seems like Agile approaches, although hardly tested at scale, have it right when they simply accept change as inevitable.
Neil
2011 November 15 at 12:11