University Develops Technique to Keep Computers Running in Overheated Data Center
To stay safe in hot conditions, experts recommend that workers move more slowly. The same advice fits data centers, apparently. To prevent outages and keep work moving, Purdue University has successfully tested out a technique for controlling operations of its computing clusters in overheating conditions by slowing down the performance of its nodes.
That’s proving a boon to the computing operations that researchers rely on at the Rosen Center for Advanced Computing at the Lafayette, IN institution. The center provides computing infrastructure services to researchers on campus and around the country. Frequently, those research projects require months of continuous computing time on thousands of processors. If something happens to shut down the computer operations while a massive multi-month calculation is being performed, the job usually has to start again from the beginning. In other words, an outage is “guaranteed” to affect many groups on campus, according to Patrick Finnegan, Unix systems administrator in Rosen’s IT Systems and Operations group.
Power outages are actually infrequent at the data center, Finnegan said. But he added that this summer, “due to some planned cooling system maintenance, coupled with the unusually hot summer, we have had some brief cooling outages.” When the temperature in the data center exceeds a certain point, the racks of computers have to be shut down. If that’s not done intentionally, they’ll shut themselves down. And that, said Mike Shuey, high-performance computing systems manager, “has ripple effects on the research efforts of the university for weeks afterward.”
Twice so far this summer shutdowns have been called for, according to Finnegan. “In both instances, the cause was a temporary capacity reduction in the campus chilled water supply.” That 50-degree water supply cools the entire facility, which includes about 15,000 processors, along with other computing systems in use in the space. When the cooling system is turned off, temperatures in the room can reach in the high 80s and 90s.
To address the planned outages, Finnegan developed a technique that allows the center to continue operating the computers–though at a reduced performance level. “Basically, I use the power saving features present in almost all modern systems to slow down the system, while at the same time reducing power and cooling usage,” he explained. “This is similar to how your laptop saves power to extend battery life. Then when things are back to normal, we just turn the systems back up to full speed, and everything takes off like normal.”
After Finnegan’s system was implemented, he said, the temperature sensors informed the IT crew that the temperatures were going down again. In one instance, on an AMD-based cluster specifically, power usage dropped from an average of about 290 kilowatts to about 205 kW, with a performance decrease of between 50 percent and 70 percent.
“The program worked, and the datacenter didn’t overheat, so the process was a success. We actually were a bit surprised it worked so seamlessly,” said Shuey. “It’s much better to have jobs run slowly for an hour than to throw away everyone’s work in progress and mobilize staff to try to fix things.”
With the successful use of his scheme, Finnegan became a datacenter hero. “I was a bit overwhelmed by all of the positive responses I got,” he said.
The Purdue crew has written up its procedures and is making them available through a storefront on foliodirect, a Web site that sells licensable university technology. “High Performance Computing Power Saving Device” is priced at $250.
According to an abstract for the report, the software runs on most distributions of Linux running x86 64-bit AMD and Intel processors.
Originally published by the Campus Technology. Read the original story here