The Real problem with the Software at Southwest Airlines.
From a former Senior Software Engineer at Southwest Airlines that was responsible for fixing, maintaining and keeping the software running that scheduled flights, pilots and crew.
Why Southwest Airlines is behind in Technology
Senior Middle management is not adopting new innovation and misreporting to C-suite leadership the actual problems with the architecture and software. 100% nontechnical people assessing a technical issue and reporting on the potential customer impact.
Southwest Airlines thinks of themselves as an Airline first and not a technology company. In today's world every company needs to be a technology company first or meet the same fate as Blockbuster.
Southwest has many highly technical developers, cloud architects, Site Reliability Engineers and DevOps Engineers, so talent at the engineering level is not a problem. It is the non-technical senior and middle management (particularly the ones who have tenure) in the Technology Services and Operations department that destroy any chance to implement best practices, new innovation and new process to improve Southwest's Software and Technical Posture.
Answer to why this happened
The application that manages the scheduling of crew members and pilots called Crew, and other apis and services went offline due to its outdated software packages and over utilized server resources aka cpu, memory and disk space.
The people who are at fault are below and why they are at fault is included. With modern software practices like automating self healing and auto scaling. The applications should have been able to handle any winter storm of any magnitude.
Directly at fault
Senior Director Technology Services and Operations at Southwest Airlines
Senior Manager Technology Services and Operations at Southwest Airlines
Co-conspirator and Enabler
Management that supports his circle of trust to keep things secret and bury issues.
Why they are at fault
They own the Crew application and other apis and services that caused this chaos and did nothing to prevent this catastrophe by ignoring the recommendations that I and other Software Engineers made to sure up this application's reliability and performance.
They refused to change antiquated software development processes and practices. They also covered up major software problems, software bugs and ignored performance issues that ultimately led to this disaster.
Advice to Southwest Airlines
- Digitally Transform so that you can build a Site Reliability team.
- Fire the Director and Manager listed in this post.
- Don't give the responsibility of keeping systems that are critical to your business in the hands of Directors and Managers who do not want to improve the systems. These managers find excuses to keep things as they are because they are incompetent.
- Talk directly to the engineers
- Create a blameless culture that allows engineers to share new ideas with out the fear of management retaliating.
- Identify the managers who try to punish anyone who tries to share an idea and fire those managers. Basically, fire the bullies.
- Hire the technical leadership who understand, cloud, incident management, postmortems, MTTR, MTTD, MTTF and the culture that goes with that.
Honestly, I really want Southwest Airlines to succeed, there are really only a few bad apples here that caused this. I love their culture and most of the people I worked with. All Southwest Airlines employees want to do the best for the company. It is time for those who don't understand technology to get out of the way of those who do and move on.