Expert Reaction
These comments have been collated by the Science Media Centre to provide a variety of expert perspectives on this issue. Feel free to use these quotes in your stories. Views expressed are the personal opinions of the experts named. They do not represent the views of the SMC or any other organisation unless specifically stated.
Tom Worthington is an Honorary Senior Lecturer in the School of Computing, Australian National University.
It is no surprise that a software upgrade caused the Optus outage. This is similar to the problem which took out the Australian Population Census in 2016
(https://www.abc.net.au/news/2016-08-09/abs-website-inaccessible-on-census-night/7711652)
One question which needs to be asked is if Optus is implementing the two-man rule. That is, can one person make a change to the system on their own? There needs to be one person input the change and another check it.
I have been asked if Optus should have redundancy. It would be possible to replicate all the hardware, but that would double to cost of services to customers and would not stop a systematic failure of this sort.
One surprising outcome is that, in this case, mobile phones proved more reliable than landlines for emergency calls. The mobile phone standards have provisions for using any company's network to make an emergency call. So phones automatically switched from Optus to Telstra, or Vodafone.
The Australian Government is already working on mobile roaming between carriers during natural disasters.
(https://minister.infrastructure.gov.au/rowland/media-release/government-scope-emergency-mobile-roaming-capability-during-natural-disasters)
This could be extended to cover other network outages. On a trip to India, I used one telco in Goa, and when they were not available in Bangalore, the phone automatically switched to another network. This is a commercial arrangement between carriers. It would require some difficult commercial and regulatory negotiations to implement in Australia.
For government, business, and domestic users of internet and phone services there are some clear lessons from the Optus outage. Don't have all your phones and Internet provided by the one company. If you are providing safety critical services, have connections to multiple networks.
Mr Graeme Hughes is Director Executive Education and Director of the Co-Design Lab in the Business School at Griffith University
It is encouraging to see Optus share findings, revealed during their internal investigation. In an era where society heavily depends on interconnected technology, establishing trust in service providers is crucial from a consumer standpoint. Had the outage occurred a week earlier in the peak of raging bushfires, the impact would have been catastrophic.
Although this represents a positive stride, the upcoming response from Optus in addressing and preventing similar issues in the future, as well as taking additional measures to compensate consumers for trade and income losses, will be intriguing to observe.
Network Instabilities resulting from changes to the routing information are a well-known and predictable problem, which are commonly associated with software updates.
A major telco should have a disaster recovery plan which is more sophisticated than your average corporate network. At a minimum, they should have had a plan to revert the changes, or remotely reboot their systems. The statement from Optus in no way clarifies how this event was exceptional, or what preventative measures they had in place to mitigate the impact.
Dr Mark A Gregory is an Associate Professor in the School of Engineering at RMIT University
The cause identified by Optus for the national outage last Wednesday morning was human error. The Optus statement is poorly worded, but it appears that a routine software upgrade to one or more key routers was the cause of the outage. A cascading failure occurred when routing information from an international peering network was received and exceeded preset safety levels on key routers. The routers then disconnected from the Optus IP Core network, bringing down the entire network.
Optus has not explained what went wrong with the test process that should have occurred before the routing software upgrade occurred.
Also, there is no explanation as to why there appears to have been a lack of redundancy of the key routers, so that if there was a problem the key routers would swap to the redundant routers, which you would expect to be running the previous iteration of software.
There remains a number of open questions that Optus has failed to explain, but we now know that the Optus outage was not hardware failure and not related to a cyber attack.