Assure Availability for a 5,000 Node Grid
The Company
This large financial institution buys mortgage “pools” – groups of mortgages packaged for resale – valued at $1 to $5 million.
The Scenario
Valuing mortgage pools for an overnight position is such a massive calculation that it requires a computer server grid to be performed efficiently. This particular calculation requires a grid of 5,000 nodes to quickly calculate the price of each pool during intraday processes, as well as for end-of-day valuation. All of the systems and their installed software in the grid must be identical in order to work together. Keeping 5,000 lines up and running at optimum speed is difficult enough. It’s almost impossible to assure that all 5,000 machines will have identical systems and installed software.
The Challenge: an old solution that doesn’t work
The traditional way to “maintain” the grid was manual, reactive, and frenzied. It took a long time, poring over audit logs, to find what on the machines had changed that could be causing the instability. Some of the solutions that have been applied hook the kernel (making core operating system level changes), which raises concerns for real-time processing, such as performance, stability, and compatibility issues during future operating system upgrades. If you put anything non-critical in the path, it’s invasive. It has the potential to crash.
Before the company found SignaCert, they were bidding on a book that was so complex, the grid had to run overnight. But something became unstable in the grid and it crashed. For days, they would run the valuation, and it would crash. They couldn’t figure out what was happening. After days of delays in the valuation, and hundreds of hours of people’s time, they found an authorized change that had the unanticipated effect of crashing the grid. Even repeated queries throughout the organization had failed to turn up this authorized change as a source of the problem.
In considering solutions that would fix these problems, they rejected immediately any that hooked the kernel. They also rejected any solutions based on one-to-one monitoring, which would be untenable.
The SignaCert Solution: A New Way
SignaCert gives grid managers a way to maintain the grid proactively and automatically, never interfering with the grid. The Enterprise Trust Server™ (ETS) executes efficient one-to-many verification, using a decoupled, device-independent reference.
How SignaCert maintains the grid
The company uses ETS to monitor the nodes in the grid against a master reference and normalize the machines, noting deviations so that they can bring deviating nodes back into spec. ETS is able to ignore inconsequential hardware differences, and focus on the software. In this company’s grid, there are known and acceptable differences between machines in the grid that do influence the software on the machine; for instance, a different network card has a different driver.
Because the company wants their grids to run as fast as possible, they know they don’t want to hook the kernel with anything invasive. ETS does not hook the kernel; the ETS console sits on its own appliance and a non-pervasive client monitors each device. It is easy to perform verifications when the grid is idle; in fact, this company runs ETS continuously until the grid isn’t idle. They run a deep scan every night to assure there is no drift. Whenever the system is idle, they run series of short, high-speed, shallower verifications to assure that critical elements of the node are still the same as the reference.
Results
The company is now bidding on more multi-million dollar books, and is getting valuation results as quickly as possible, with no delays. The IT staff managing the grid spends much less time maintaining it, because they no longer have to search for deviations – ETS verifications tell them exactly where to look and what files need to be fixed. Both management and the IT staff are more confident in the grid, and have dramatically less stress about its performance.
The financial costs are huge: the customer now knows they aren’t missing out on any multi-million dollar bids. They no longer worry how many millions might be left on the table because the grid was unstable. The cost of SignaCert ETS alone was quickly recovered in the reduced time needed for IT maintenance. But the insurance value is far greater: preventing the loss of just one of these deals far eclipses any IT costs.