Skip to content. Skip to navigation

InfoWest

Sections
Personal tools
You are here: Home News Newsletters August 2005 July 25 Outage Information
Navigation
Contact Us
St. George
Phone:
435.674.0165
Toll-free:
866.INFOWEST
Fax:
435.674.9654
Address:
148 E. Tabernacle
St. George, UT
84770
Cedar City
Phone:
435.865.0606
Toll-free:
888.229.0721
Fax:
435.865.7451
Address:
444 S. Main Street, Suite A6
Cedar City, UT
84720
 

July 25 Outage Information

Document Actions
On the morning of July 25th, we had a service outage at our main data center due to some issues with our dual air conditioner systems. In order to protect our equipment from heat buildup, much of our core network equipment automatically shut off, and we manually shut down other equipment. The outage lasted from about 5 am to about 9 am. We have made necessary repairs and all services are operating normally. We don't expect any other problems of this sort, but we've added many safety measures to be sure and alert us early if there is a problem. As always we appreciate your patience as we work hard to keep your service up and operational under any circumstance. Read on if you want to see more details of the problem and what we're doing to prevent future problems of this sort.

Description of problem

Our data center has two 5-ton air conditioner units set up in a redundant, load sharing configuration.  Each acts as a "master" for 24 hours at a time, with the second coming online as needed to supplement the master unit cooling capacity. One of the units has tripped its circuit breaker (taking it  completely offline) several times since installation. We consulted with an AC repair company, and they believed that the mercury thermostat was causing the unit to cycle too rapidly. When the unit doesn't have time to equalize pressure before being started again, apparently it draws a much greater current load than normal. So, last week we installed a new top-of-the-line electronic thermostat with a 5 minute delay timer to prevent rapid cycling.

Early Monday morning several InfoWest employees received an alert that the temperature was rising in the data center. One made a follow-up acknowledgement to it that those same employees also received. He assumed that the (at the time, not too severe) rise in temperature was due to continuing problems with the north AC unit's circuit breaker tripping. His follow-up acknowledgement to the alert was simply to go check the breaker later Monday morning during office hours, since in the past the south AC unit was still capable of cooling the facility by itself.

Later in the morning, we received additional temperature alerts so a tech was dispatched to check the situation.  He found that the primary unit was still partially operational, but the compressor was iced up and only warm air was being blown.  The secondary unit was not correctly sensing the problems with the primary unit, and was not operating its compressor, it was just using its fan to recalculate air. Therefore since neither unit was operating a compressor, no cooling took place leading to the high temperatures which automatically shut down some of our equipment.

Immediate Corrective action taken

Upon arriving at the data center, our technicians  opened the doors to help with air circulation and also manually shut down a number of our other servers and routers to prevent heat damage until we could get the air conditioners working correctly. Others arrived with electric fans to further help with circulation.  We contacted the air conditioner repair company who installed the thermostat the week before.

We further investigated the AC problems, and eventually found
that by switching the main power off for the primary AC, the secondary AC began to function properly and immediately began to cool the datacenter.

When the AC repair company arrived, they began the process of de-icing the primary AC and began work to find out what went wrong with the AC programming that prevented the secondary unit from functioning properly. They identified the problem and had it fixed very quickly.

Immediate actions to prevent the problem from recurring

We have dual (two-stage) alarms for high temperature situations that should have alerted us earlier to these problems.  The first, which we received earlier in the morning, is an email-based alarm/paging system.  The second works via interior building temperature sensors wired through our 24-hour monitored fire/burglary alarm company, and alerts a different set of people. We did not receive any calls from the alarm company when the temperatures surpassed our alarm point.  We immediately called them and they came onsite to repair the misconfiguration that stopped the alarm company from being alerted.

Later email alerts were not possible due to the fact that our main routers were automatically shutoff and offline.

The AC repair company also made changes to the programming on the AC units.  The AC's were at the time switching to an "economizer mode" that blows cooler air in from outside.  Since the temperatures in this area are too high for the "economizer mode" to work correctly most of the year, they disabled that mode.  They also set up both AC's to continually run their fans, regarless of whether the compressor is working or not (this will help de-ice the compressor in case it freezes). Lastly, they raised the thermostat temperature in the building a couple of degrees to prevent overworking the AC units when they were not necessary, which should help prevent any further compressor freezing altogether.

As temperatures went down, we worked to bring individual components of the network back online, with the main routers being the top priority. All of our systems were back online within the hour.

Long term plans

In working with the AC company to make future plans, they found that alarm circuits are available on our AC for such things as high temperature, power problems as well as problems with the compressors (high or low pressure).  We are working to find ways to monitor those additional alarm circuits.  We also manually tripped the secondary temperature alarm equipment the next morning and verified that the alarm company had correctly configured the alarm system, which resulted in immediate calls from the alarm company.

We will add additional checks for such problems in the future by paging more people at more frequent intervals as the temperatures or other alarms go off.  We are also going to develop a way to more immediately alert critical customers to such problems in a way that doesn't delay efforts of those working to repair the problems.  These alerts will probably be sent through an alternate channel such as wireless or telephone modems in addition to our standard Internet connections. Hopefully with the actions we have taken this should never be a problem for us in the future. Our goal is to make our services available 100% of the time for our customers.

 

Powered by Plone CMS, the Open Source Content Management System

This site conforms to the following standards: