The Biggest Single Point of Failure in Human History - a Cloud
Updated: Oct 16, 2020
Opinion: There could be possible headwinds from clouds in the future. Could the current technology path cause more headaches in the future?
"Back in my day we didn't have virtualization or the cloud, in fact we had a physical server for every application and token ring told us when we could speak!" This might be some of the stories you might hear from some of the IT veterans out there.
Lately I've been thinking a lot about the past and current trends in networking and enterprise architecture overall and how things might turn out for better or worse. Part of it is the constant bombardment of marketing and reports, you'll hear a lot about how "80% of enterprises will be moving into the cloud over the next 5 years" and that X new company is now offering YaaS etc. (I really hope there isn't actually a Y as a service now). Don't get me wrong though, the cloud is providing a lot of opportunities for exponential growth as an industry and for some of these individual companies that probably couldn't have been attained otherwise.
What I'm getting at however, is although things have been relatively good and calm in the cloud and networking space, I think there are some headwinds on the horizon and we as the IT community need to fully understand the decisions we are making as we take humanity down this road. I say we because information technology is enabling society to evolve and 'ascend' to our next level by using technology in the "4th industrial revolution".
Now I really don't want to talk on some of the other byproducts like privacy or jobs etc. just how the cloud and networks could have a possible negative impact on us in the future based on the current direction and what we should consider. Most of my posts haven't really gotten philosophical but I'd like to get some of my thoughts out there, I enjoy this type of thinking and want to do a few more pieces like this.
For those who don't know a "cloud" is a term of abstraction to reference complex systems often used when describing data centers, networks, or services.
A Single Point
When designing technology systems either from a physical or logical stand point you generally will have built in redundancy, so when a component fails the system will continue to function via the backup. A single point of failure (SPOF) is when you identify a single component that multiple components rely on, meaning if that 1 piece failed it would cause the entire system not to function. Generally you try to avoid single points of failure in design.
Very few systems have an overarching way of directly effecting the global human population quickly. Outside of something like the fiat currency system, which if magically everyone lost faith or the backers failed could bring the world to it knees, the only thing that could have a similar effect is the internet. The internet is a giant interconnected network of routers, switches, cabling and mobility towers that carries information to keep the world connected, it as a concept is a "cloud" and supports the server clouds. The internet as we know has evolved and grown for the last 30 years and has been a real driver of societal development due to the ability to connect countries and continents.
The big time server clouds like amazon web services (AWS) are highly redundant, geographically diverse systems of interconnected servers, networks and data centers which have an easy to use software interface. They have multiple fail safes with as few SPOFs as possible from a detailed design perspective. There are a few other major players and some minor ones as well spread across hundreds of locations around the world.
However, from a global architectural perspective the cloud as a core concept is leading to the centralization of services, applications, and data for thousands of organizations along with major governments. This means that the cloud and the internet are now a dependency for society to function when looking at things like doing business or communicating, to entertainment and leisure, even defense and security.
Part of the reasons for their popularity is that some organizations either lack the time, talent, leadership, or capital to deploy applications/hardware that meet the necessary solution requirements, so they would rather depend on someone else to configure and maintain their infrastructure or technology (using that wording for lack of better definition).
Point is from an overall application standpoint, looking at the growth and the trends such as companies preferring an opex vs capex model - everything will be in the "cloud" in the near future (aka someone else's computer). Perhaps in a 10-20 year time frame. By having critical applications, data, and other IT integrations held in essentially a single logical place, are we setting ourselves up for a single point of failure?
Outages are growing larger in scope
There's a few types of clouds I am referring to. There is the data clouds that host applications and store data like AWS and Azure, there are the CDN type clouds which are more like private networks being content (cache) oriented like Akamai and Cloudflare, then there are cloud services that are hosted by companies like Microsoft or Google. Finally, there is the biggest cloud on everyone's network diagram and that is the internet.
In the last year it seems like outages have really increased based on the news and experience I've had. For example: Cloudlfare had a brief outage in July which caused websites it proxies for have degraded access, T-mobile had an outage lasting near a day in the network effecting voice calls/sms and Microsoft had an outage with Azure active directory services last month. Plus distributed denial of service attacks are on the rise this year. There were more which you can search on as well.
You might know but azure active directory provides authentication services for Microsoft and also to other 3rd party applications with integration support. There is a on-prem version and a cloud integrated version. By having this service fail users were unable to authenticate/login to use applications like e-mail and some administrators were left unable to manage their domain. The article I linked illustrates some of the redundancy measures Microsoft has in this service and the length it goes through validating upgrades and changes. However the issue was a failure in the software which bypassed typical processes of deployment. Although it was quickly caught and remediated it still took a number of hours. The failure here is probably the hardest to catch, but when a service like AD authentication fails its can be a disaster for organizations. Microsoft O365 is probably the biggest service I can think of that is cloud based and used by nearly every major organization on the planet. Software is becoming the corner stone of the technology revolution and I want to touch on that a later in this article.
Cloudflare is a large 'cloud' provider that describes itself on it's website - "Cloudflare secures and ensures the reliability of your external-facing resources such as websites, APIs, and applications. It protects your internal resources such as behind-the-firewall applications, teams, and devices. And it is your platform for developing globally scalable applications."
They have over 50k customers who essentially have their websites flow to cloudflare to protect from things like distributed denial of service attacks. The recent outage was caused by human error which then cascaded to hardware being overloaded. Consequently many customer's websites were unable to load. Human error is a common problem in IT and although it can be avoided, even with good change management and verification errors can still happen. Add that error to an automation process pushing a change to thousands of nodes and a disaster is bound to happen. Luckily here it was apparently only 1 node. Cloudflare is still a relatively new company, but I wanted to list this outage because of the popularity they've attained by being 'cloud' where many websites are flocking to them for protection.
People lately are generally associating the cloud with stability and simplicity but some of these companies have grown so fast I question if they can mature fast enough to maintain that stability and growth. With so many websites depending on one service the warning lights start flashing and (as my old boss would say) I raise my hand to bring up the question.
You probably heard about the T-Mobile outage over the summer. As I remember being unable to call someone and also witnessed it trending on twitter that day. If I recall correctly the official statement T-Mobile claimed it was a configuration or hardware failure in which redundant systems did not take over or the network did not behave as intended during this scenario, which is somewhat similar to the cloudflare incident. The take away here is that phone and text services were not functioning for one of the 3 largest mobile providers in the world which potentially effected millions of people; this included 911 emergency services being unavailable. The FCC is currently investigating and says the outage was unacceptable.
No, not our social media dependency for validation, I mean the internet. Its the largest network that makes everything possible. Yet it is a dangerous place, there are malicious actors constantly on the prowl to exploit and gain from a victim. Although there has been data leaks from cloud services, most of these are caused by user configuration, but since there are these bad actors and its so easy for anyone to spin up a virtual instance it does leave the possibility for these data leak incidents to remain. Helping point users to secure configurations should continue to be a goal of providers.
These bad actors also attempt to take systems offline with large attacks attempting to overload the bandwidth of the target. With both the volume of bandwidth and the total number of incidents growing I question how long it will be until the mother of all DDoS attacks happens taking offline social media, critical systems and cloud networks. To me its not out of the realm of possibility for a large sponsored threat to take years of compromising or placing computers around the world to then execute a huge volumetric attack against a data center or cloud provider etc. I touch on this subject in another post of the Art of Cyber War and Cyber Battle. Hopefully with all the 5G talk and investment in backbone fiber, the providers can stay strategically ahead of the capacity race.
The reason I'm bringing all of this up is to show that in my opinion the outages that once were smaller in scope due to the distributed nature of networks and systems are getting larger and larger in scope as organizations integrate into these cloud services and as society becomes more dependent on technology, thereby depending on the internet. The problem here is if the trend continues we could have a large global network or service outage that effects the public in such a way that the pitchforks come out along with people in suites who can issue subpoenas and pass laws. We see a lot of talk now from business leaders about moving to the cloud, even talking about how to manage multi-cloud! Yet a lot of this I think is revenue and bandwagon motivated, so if or when some of these larger outages manifest and money is lost I think some of those opinions could shift.
The strength and orderliness of the column architecture of the building cannot be disputed. Used for thousands of years, taken from the Roman and other cultures, some column structures still stand. Contrast it with the abstraction of the cloud, has something just as strong been developed for the world to be used into the distant future? or will we be left wondering how we could have thought us humans could keep up the pace.
I'm not saying the cloud is inherently bad or that there is a better way. Realistically like I previously mentioned the scale/growth at the cost of using someone else's computer is probably easier and cheaper and companies like cheaper! Although we should stop for a minute and think about the past and how we want the future to look.
When looking back at the growth of the internet and how it came to be, the internet included more entities and distribution in the earlier years than the present day hyper-cloud has. This was similar (to me) of how electrical/water utilities are distributed and affect smaller areas of control. Though recently, there has been large consolidations like the Level 3/Century Link merger, T-mobile/Sprint or the break up and subsequent mergers of the Bells, AT&T, SBC etc. Plus interconnections of direct peering between companies has grown.
Therefore, previously this meant companies deployed infrastructure ad-hoc in a distributed manner. The combination of the mergers and rise of automation to manage these enormous domains has really brought about a larger failure domain - meaning single entities control larger portions of servers and networks which increases the potential for bigger outages (e.g. recent t-mobile disruption).
Either way the internet is still a BGP mess that in all of its glory has local and wide reaching outages and issues that are still in the infancy of fixing. Part of this was the lack of centralization in policy for part of the growth cycle and the need to connect things as quickly as possible to keep the business momentum up.
Whereas with the cloud we have a few major players of Amazon web services, Google, and Microsoft Azure which have provided a centralized path lacking of diversity which mitigates some of the problems of the Internet's growth, but then creates the single point of failure problem. One thing these services have over the internet is the maturity of the technology and the methodology of administration when starting. i.e. the cloud as we see it today somewhat started later in time where legacy web hosting and networks have evolved over a longer period of time (more brown field and legacy solutions still deployed), although the internet is still a primary dependency of hosting these cloud services. The platforms might be easier to scale but like the AD outage they could be 1 software glitch away from pandemonium.
This has made it extremely easy for companies to create services on top of services - I like to think of it as a comprehensive 3 layer model - with the internet, hosting, and then application services at the top tier. This leads to a scenario where an organization might think they are getting redundancy by using different applications from different vendors but both applications could actually be hosted on the same azure platform for example. That's why I think a hybrid approach is probably best. Most organizations should invest in some infrastructure and talent on-premises for critical applications or disaster recovery purposes. This will help mitigate possible problems in the future with global outages. So if the internet is having a problem some of the business processes can still remain operational. We don't see this mentioned a lot, and one thing pointing to the contrary is all of the SaaS and IaaS options along with companies moving to a "software" model. I will say though that I'm noticing more hybrid solutions entering the market, but I think part of this is to help orgs transition out of on-prem applications to cloud!
We should exercise caution as we proceed to integrate every application and service into a centralized entity. For humans can make mistakes, systems can fail, software can be hacked or cause disruptions, and these disruptions can get bigger and bigger as we proceed on the current track - it will find a way. (Channeling my Jeff Goldblum here). One automation push with a flawed configuration or software error can potentially cause a cascading effect for critical services to becomes unavailable - sometimes for extended periods of time to thousands of entities.
Humans have overcome many challenges over the last 10,000 years and we will continue to make strides in many fields. Nevertheless, when looking at the technology of the future whether its organized chaos or orderly architecture we should consider how things are being built, by who and with what tools. We should constantly question from top to bottom if we are going in the right direction and look to establish leadership in this area with industry oriented organizations, because if we do not then other entities will create the oversight and path forward whether its welcomed or not.
From a tactical sense I'd like to see more cross training between network and systems and software disciplines. This will help teams understand each other and better assist in merging them to accomplish goals in order to create and maintain better products.
On the networking side of things there is a trend of increasing software integration into networks with automation. Mostly custom solutions. However the divide between people with network skills and no software skills and software skills with little network skillset is increasing. We need more vendor investment and time with operators and organizations to bridge this gap with standardized solutions. By having more standardized solutions we can better develop a reliable platform(s) for the networks of the future. With networks growing larger with bolt on pieces the need for automation is growing, the dependency just builds day by day.
In the distant future it is my opinion that we will have artificial intelligence writing the computing languages of the future and creating its own AI programs to perform actions. Until then though we will have humans writing the software, so I hope we can continue to develop secure practices to minimize coding issues and security flaws. After all the cloud is very software centric under the fancy GUI. If there were to be a critical flaw exposing one of the major cloud platforms we would have a very big problem from a service availability or data integrity perspective. Therefore, software quality must be maintained and increased.
You probably realize now we have a lot of old timers retiring so we need to ensure we are getting that experience and knowledge transfer in order to understand the mistakes of the past so they aren't made again. I believe we need to continue to pull real engineering practices into the industry from others to keep raising the bar on expected quality of workmanship for the next tech generation. We need to stop bolting on this to that and look at the big picture. By doing this I think we can continue to create top talent to help support these monotonous systems of networks and servers! Moreover, its clear not all of the talented engineers will end up at the cloud providers, which is why investing in good FTE talent can help companies execute on the hybrid model and not entirely rely on 3rd party vendors.
Before the last 20 years we largely had multiple means of communication like the telephone, internet and radio. With the decommissioning of old technology like the plain old telephone systems and the rise of new technologies/merging with others, we are having to rely more and more in the internet and cloud platforms. Take all of the myriad of VOIP providers hosting all of their servers and PBXs on cloud platforms and in combination with mobile wireless relying on IP networks we now have the convergence of multiple technologies when before they were predominately separated. In 10 years a global outage could mean something like 25+% of the worlds population is unable to communicate with each other at one time. Or in the case of something like a Microsoft service outage, we could see e-mail not working for tens of millions of users.
In closing I realize a few of my statements are probably a bit over blown, after all I did mention there are lots of fail safes and redundancy built into these platforms. BUT, and that's a big but, there is an existence of a growing dependency with businesses and governments deploying data and services to the cloud and over the internet. This growing trend and speed of cloud adoption could create a climate where outages become wider in scope and include more critical services where public health or defense are compromised.
Its hard to argue against the model or to advocate for a distributed one; because after all the cloud is inherently distributed. Though, the software integration and network interconnections brings it back to the single failure domain discussion. The current push for speed to capture the cloud revenue is really driving the current push in my opinion, however the fastest way to point Z isn't always the best one.
There are a lot of people pushing to help make the internet a better place but more investment in time and capital is needed to create more redundancy, capacity and security. By improving those 3 items online we can help support the upper layers of cloud infrastructure and services. We need to ensure we are developing the best possible solutions for the long term future, right now revenue might be the driving factor, but lets not make lost revenue and interruptions our motivation for moving away from the cloud (e.g. outages are happening and now we are looking for solutions to increase reliability). Don't forget about software development too. I'd like to see more talk from leaders about some of the aforementioned points, even if it could be unpopular at the moment.
Back to the original quote about what old IT pro's are saying to us young folk, what will we be saying in 20 years from now? Maybe "Wow we really should have done it a different way." or will it be "We couldn't have done it differently. anyway - Marvis re-route 100Gb of traffic from backbone 2 to sector 5." only time will tell but we should be cognizant of our global architectural decisions. Thank you
a few pictures were taken from r/networkingmemes