Knight Capital's Quality Control and CM Problems
On August 1, 2012, Knight Capital experienced what all IT shops hope to avoid. A software release that was put into production caused more than $440 million in trades to take place. In this Los Angeles Times article, Chairman of the Board Tom Joyce was interviewed about the debacle and his words are very telling about what went wrong.
I feel strongly that configuration management and quality control could have prevented this disaster. Knight Capital has not released exactly what happened and why the software failed, but I believe we can deduce some things that probably occurred or didn’t occur before this release was put into production. A second article, from the The Wall Street Journal also discusses what happened in greater detail.
The first question that comes to mind is: Was this tested thoroughly? I have to answer “no” to this question. If it was fully regression tested, then this type of error would not have occurred. I would also say the same thing about any system or unit testing.
According to the LA Times article, Joyce “blamed the faulty trading on software that was somehow activated when the new trading program went live.” This vague statement is very telling. He states that activated software somehow caused the glitch. Either this was done during the installation or the software became active after the installation. Either way, the bug was not found in testing, and the company paid a heavy price for this error.
We can attribute this to a poor release and deployment process. If software was activated during the release process, was the release process ever tested, documented correctly, and vetted out before releasing to production?
From a quality control standpoint, we know, or at least can feel fairly certain, that something is broken in Knight Capital’s testing procedures. We can deduce that something went wrong in the CM arena. I think that Knight Capital does not have a strong or robust Change Control Board (CCB). A good CCB makes sure that all issues are considered and that a rollback procedure is in place and tested. Considering the nature of Knight Capital’s business, the necessity of rollbacks and ensuring that errors will not occur should be paramount.
The LA Times article states that “new regulations could require a kill switch that could abort runaway computer trading algorithims.” Joyce goes on to say that the kill switch should not shutdown an entire firm if an error occurs in one area. He says, “There are things you need to make sure you take into consideration before you just hit the big red button.” I say there are 440 million reasons why he is wrong in his assessment.
Mistakes happen and Knight Capital is not the first company to have a breakdown in CM and quality control processes that cause issues. There are several examples where poor quality control and poor configuration management caused problems. What we as practitioners of CM and QC have to do is to get in front of the issues that can cause bigger problems for the companies we work for.
The Knight Capital issue should not have occurred, and the glitches should have been caught long before they were put into production. The $440 million represents two years of profits for Knight Capital. They are lucky to still be in operation.