Learning from the New York Stock Exchange’s Technology Failure

The New York Stock Exchange (NYSE) experienced a serious technology failure this week of approximately 3.5 hours, after experiencing reduced functionality for the first 2.5 hours of the trading day.  The NYSE is of course a very high profile, internationally critical component of our financial systems.  System wide failures are extremely rare, and when they do occur they are publicized.  This allows us to consider what happened, and what we can learn from it that may help you.

What was the Technology Failure?:

The NYSE has numerous software applications that are integrated to provide a cohesive system for access and control.  There is the core record keeping system, systems to manage the process, customer systems to control accounts and execute trades, systems that monitor activity for fraud, etc. These systems exchange data with each other at various levels, and are dependent on being compatible and reliable.

On Tuesday evening, July 7, 2015, NYSE administrators applied an update to one of these systems to support a change in how the industry timestamps transactions. On Wednesday morning, July 8, 2015 the NYSE started noticing issues with communications between systems and applied an update to the customer system, this in turn created more issues.

The problem was not resolved, and at 11:30am the NYSE shut down trading and continued to work on the issue. At just after 3:00pm, non-updated backup systems were brought up in place of the production systems and operations resumed.

A quick synopsis can be seen here: http://www.cio.com/article/2946354/software-update-caused-nyse-suspension.html

What do we learn that is applicable?

What does you SMB sized organization take away from this?

We may be able to continue operations. The NYSE must have a level playing field to allow everybody to execute trades at the same time, or else fraud or inequality of opportunity become an issue.  Your business may be able to continue operations without a complete shutdown if one function is limited or creating any data issues.  For example, if your customer service system is down and orders via the web cannot be taken, it may be possible to place a message holder informing customers they can call customer service to place an order.  You may need to temporarily reallocate staff to handle more call volume, but customers can still be serviced and a more intimate conversation take place during the transaction.

Systems are complex, especially multiple systems that communicate with each other.  Software, especially software designed for a specific organization and use, can be complex. The luxury of waiting for others to test it in the real world is not present.  So testing is essential and it must reflect the real world: real data, real transactions, real systems that mirror the production system with the changes tested applied.  The testing must be broad, rigorous and deliberate, and results must be tracked.  Automated test tools can make the process more efficient, but they are just pieces of software and must be setup and used correctly. When multiple systems are involved and dependent upon each other, they all must be exercised.

Disaster Recovery works, but is a choice to execute.  In this case, the NYSE decided to cut over to the backup systems to continue operations.  This is not the same thing as pulling a server out of the closet and installing everything and going back to operations.  This is a “hot system”, one that has all of the live data but was not updated with the errant code.  It is not a small decision to cut over, as there is normally a cut back process when issues are resolved, but one they could make because they had designed the systems for it.  This allowed them to resume operations while still dealing with the issue. Most small organizations do not have this capability, but they can, and can have it for a very economical price.  My firm, Keystone Technology Consultants, offers this for even very small clients of 20 users. It is not just a peace of mind issue; it literally allows an organization to continue operations and keep the flow of work and money, and maintain their reputation and client relationships. It is essential.

I would love to hear your view of this, feel free to comment below.