July 25 Outage Information
Description of problem
Our data center has two 5-ton air conditioner units set up in a redundant,
load sharing configuration. Each acts as a "master" for 24 hours at a
time, with the second coming online as needed to supplement the master unit
cooling capacity. One of the units has tripped its circuit breaker (taking
it completely offline) several times since installation. We consulted
with an AC repair company, and they believed that the mercury thermostat was
causing the unit to cycle too rapidly. When the unit doesn't have time to
equalize pressure before being started again, apparently it draws a much
greater current load than normal. So, last week we installed a new
top-of-the-line electronic thermostat with a 5 minute delay timer to prevent
rapid cycling.
Early Monday morning several InfoWest employees received an alert that the
temperature was rising in the data center. One made a follow-up
acknowledgement to it that those same employees also received. He assumed
that the (at the time, not too severe) rise in temperature was due to
continuing problems with the north AC unit's circuit breaker tripping. His
follow-up acknowledgement to the alert was simply to go check the breaker
later Monday morning during office hours, since in the past the south AC unit
was still capable of cooling the facility by itself.
Later in the morning, we received additional temperature alerts so a tech was
dispatched to check the situation. He found that the primary unit was
still partially operational, but the compressor was iced up and only warm air
was being blown. The secondary unit was not correctly sensing the
problems with the primary unit, and was not operating its compressor, it was
just using its fan to recalculate air. Therefore since neither unit was
operating a compressor, no cooling took place leading to the high
temperatures which automatically shut down some of our equipment.
Immediate Corrective action taken
Upon arriving at the data center, our technicians opened the doors to
help with air circulation and also manually shut down a number of our other
servers and routers to prevent heat damage until we could get the air
conditioners working correctly. Others arrived with electric fans to further
help with circulation. We contacted the air conditioner repair company
who installed the thermostat the week before.
We further investigated the AC problems, and eventually found
that by switching the main power off for the primary AC, the secondary AC
began to function properly and immediately began to cool the
datacenter.
When the AC repair company arrived, they began the process of de-icing the
primary AC and began work to find out what went wrong with the AC programming
that prevented the secondary unit from functioning properly. They identified
the problem and had it fixed very quickly.
Immediate actions to prevent the problem from recurring
We have dual (two-stage) alarms for high temperature situations that should
have alerted us earlier to these problems. The first, which we received
earlier in the morning, is an email-based alarm/paging system. The
second works via interior building temperature sensors wired through our
24-hour monitored fire/burglary alarm company, and alerts a different set of
people. We did not receive any calls from the alarm company when the
temperatures surpassed our alarm point. We immediately called them and
they came onsite to repair the misconfiguration that stopped the alarm
company from being alerted.
Later email alerts were not possible due to the fact that our main routers
were automatically shutoff and offline.
The AC repair company also made changes to the programming on the AC
units. The AC's were at the time switching to an "economizer mode" that
blows cooler air in from outside. Since the temperatures in this area
are too high for the "economizer mode" to work correctly most of the year,
they disabled that mode. They also set up both AC's to continually run
their fans, regarless of whether the compressor is working or not (this will
help de-ice the compressor in case it freezes). Lastly, they raised the
thermostat temperature in the building a couple of degrees to prevent
overworking the AC units when they were not necessary, which should help
prevent any further compressor freezing altogether.
As temperatures went down, we worked to bring individual components of the
network back online, with the main routers being the top priority. All of our
systems were back online within the hour.
Long term plans
In working with the AC company to make future plans, they found that alarm
circuits are available on our AC for such things as high temperature, power
problems as well as problems with the compressors (high or low
pressure). We are working to find ways to monitor those additional
alarm circuits. We also manually tripped the secondary temperature
alarm equipment the next morning and verified that the alarm company had
correctly configured the alarm system, which resulted in immediate calls from
the alarm company.
We will add additional checks for such problems in the future by paging more
people at more frequent intervals as the temperatures or other alarms go
off. We are also going to develop a way to more immediately alert
critical customers to such problems in a way that doesn't delay efforts of
those working to repair the problems. These alerts will probably be
sent through an alternate channel such as wireless or telephone modems in
addition to our standard Internet connections. Hopefully with the actions we
have taken this should never be a problem for us in the future. Our goal is
to make our services available 100% of the time for our customers.