Cooling failure in a data center is not a maintenance problem — it is an SLA crisis. At $9,000 per minute of downtime (Uptime Institute), a 23-minute thermal event costs more than $200,000 before the first invoice is written. The difference between a facility that maintains 99.999% cooling uptime and one that loses that SLA is not better equipment. It is better maintenance. Sign in to OxMaint to start your data center cooling maintenance programme, or book a demo to see the IoT sensor integration and critical cooling PM workflow.
Case Study / Critical Facility
Data Center Maintains 99.999% Cooling Uptime with OxMaint HVAC Monitoring
How a Tier III colocation data center eliminated cooling-related downtime incidents, reduced PUE from 1.74 to 1.41, and maintained 99.999% cooling infrastructure uptime across 24 months using OxMaint IoT sensor integration and predictive maintenance scheduling.
99.999%
Cooling infrastructure uptime — 24 consecutive months
68%
Reduction in unplanned cooling incidents (Year 1)
1.74 → 1.41
PUE improvement over 18 months
$892K
SLA penalties avoided in Year 1 alone
9 days
Earliest thermal fault detected before critical threshold
Facility Profile
TypeTier III colocation data center — N+1 cooling redundancy throughout
Size12,000 sq ft white space — 480 rack positions, average rack density 8.4 kW
Cooling Assets6 CRAC units, 2 chillers, 4 cooling towers, 8 CDUs, 12 in-row cooling units
SLA Commitment99.999% uptime — 5.26 minutes allowable downtime per year per customer contract
Prior MonitoringBMS-level alerts only — no trend analysis, no predictive capability, no PM scheduling
OxMaint FeaturesIoT Sensor Integration, Predictive Maintenance AI, Redundancy-Aware PM Scheduling, Compliance Documentation
The Incident That Changed Everything
Day 1
CRAC unit discharge air temperature 0.8°F above baseline. BMS logs the reading. No alert generated — below alarm threshold.
Day 5
Temperature deviation now 1.9°F above baseline. Still below BMS alarm threshold. Trend not reviewed — no trend monitoring in place.
Day 9
Compressor trips on high head pressure. Adjacent CRAC units pick up load — but one has a partially clogged condenser coil reducing capacity 22%, and another has a refrigerant charge 15% below spec from an unescalated leak.
+23 min
340 rack units enter thermal shutdown. 47 customer workloads go offline. SLA penalties: $892,000. Emergency chiller rental: $34,000. Investigation and remediation: $28,000. Total incident cost: $954,000.
Root Cause
Not equipment failure. Not inadequate redundancy. The N+1 design worked as intended. The failure was three maintenance deficiencies — a developing CRAC compressor fault not detected, a coil not cleaned on schedule, and a refrigerant leak documented but never actioned — that individually would not have caused an outage but combined to overwhelm the redundancy margin.
The Fix Required
Not more redundancy. Not more equipment. A maintenance management system that tracked every CRAC unit's actual condition against baseline, enforced PM completion across all units simultaneously, and ensured that documented defects generated work orders — not logbook notes.
OxMaint Implementation: What Changed
Before
After OxMaint
Thermal Monitoring
BMS room-level temperature alerts — no per-unit performance trending. Alert fires when temperature is already critical.
IoT sensors on every CRAC unit — discharge air temperature trended against unit-specific baseline. Alert fires when deviation begins, not when it peaks.
Before
After OxMaint
PM Scheduling
Calendar-based PM on 12-month intervals per unit. No coordination between units — CRAC 1 and CRAC 2 could be simultaneously on scheduled PM with no redundancy.
Redundancy-aware PM scheduling — OxMaint prevents simultaneous PM on units sharing a redundancy group. N+1 coverage maintained at all times.
Before
After OxMaint
Defect Escalation
Technician logbook entries for defects found during PM. Refrigerant leak documented in logbook three months before the incident — never converted to a work order.
Any defect logged in OxMaint mandatory generates a work order. No logbook-only documentation. Defects without work orders cannot exist in the system.
Before
After OxMaint
Standby Unit Readiness
No formal standby unit verification. Backup CRAC units assumed ready — capacity checked only when activated during a real event.
Monthly capacity verification work orders for all standby units. Refrigerant charge, coil cleanliness, and compressor current draw verified before the unit is ever needed.
99.999% Cooling Uptime Is a Maintenance Achievement. OxMaint Makes It Repeatable.
IoT sensor trending, redundancy-aware PM scheduling, mandatory defect escalation, and instant compliance documentation — built for Tier III and Tier IV critical cooling environments.
24-Month Results: Uptime, PUE, and Cost
Cooling Uptime
99.999%
24 consecutive months — 5.26 minutes total downtime against 99.999% SLA. Previous 24-month period: 99.94% (315 minutes of downtime, $2.1M in SLA penalties).
Unplanned Cooling Incidents
11 / year→3.5 / year
68% reduction. Remaining incidents were power-related, not cooling maintenance failures.
PUE (Power Usage Effectiveness)
1.74→1.41
19% improvement. Coil cleaning compliance and refrigerant charge maintenance drove the majority of the gain. At 4MW average IT load, the PUE improvement saves approximately $420,000 annually in energy costs.
PM Completion Rate
58%→97%
OxMaint mobile app eliminated the documentation gap. Technicians complete PM records from the equipment, not from memory at shift end.
SLA Penalty Exposure
$2.1M / 2yr→$0
No cooling-maintenance-attributable SLA penalties in 24 months of OxMaint deployment.
The IoT Monitoring Dashboard: What Gets Watched and Why
| Asset |
Sensor Parameters |
Baseline Alert Trigger |
Lead Time Before Critical |
| CRAC Units (×6) |
Discharge air temp, return air temp, refrigerant pressure, compressor current, fan vibration |
0.5°F above baseline for 3 consecutive readings |
5–14 days |
| Chillers (×2) |
Approach temperature, condenser ΔT, compressor bearing vibration, chiller COP vs rated |
COP drops >8% below baseline OR approach temperature rises >1.5°F |
2–6 weeks |
| Cooling Towers (×4) |
Approach temperature, fan motor current, water conductivity, basin level |
Approach temperature rises >2°F above design point |
1–3 weeks |
| CDUs (×8) |
Supply/return water temperature delta, pump flow rate, pump motor current |
ΔT drops >10% below design — indicates reduced flow or fouling |
1–4 weeks |
| In-Row Cooling (×12) |
Hot aisle inlet temperature per row, unit discharge temperature, filter ΔP |
Hot aisle inlet exceeds 80°F ASHRAE A2 class upper limit |
Hours — escalate immediately |
"
The incident that led us to OxMaint was the most expensive maintenance failure in our facility's history — and what made it infuriating in retrospect is that every piece of information needed to prevent it was already in our systems. The discharge temperature trend was in the BMS. The condenser coil inspection was in the maintenance log. The refrigerant leak was in a technician's handwritten note. None of it was connected. OxMaint gave us a single system where a sensor deviation becomes a trend, a trend becomes an alert, an alert becomes a work order, and a work order becomes a completion record. That closed loop is what 99.999% uptime actually looks like from the maintenance side — it is not luck, and it is not redundancy. It is every defect tracked from detection to resolution without exception.
Director of Critical Facilities, Tier III Colocation Data Center
12,000 sq ft white space · 480 rack positions · 24-month OxMaint deployment · 99.999% cooling uptime maintained
Five Nines of Cooling Uptime Is Earned One Work Order at a Time. Start with OxMaint.
IoT sensor integration, redundancy-aware PM scheduling, mandatory defect escalation, and Tier compliance documentation — the critical facility maintenance stack that protects SLAs and keeps customers signed.
Frequently Asked Questions
What is the most common cause of cooling-related downtime in Tier III data centers?
The most common root cause is not a single equipment failure but the convergence of multiple maintenance deficiencies that individually would not exceed the redundancy margin but collectively do. The incident in this case study is representative: a developing CRAC fault, an overdue condenser coil cleaning, and an unescalated refrigerant leak — each manageable in isolation — combined to overwhelm the N+1 redundancy during a 23-minute event. Tier III facilities lose their effective redundancy not because of equipment inadequacy but because backup units are not maintained to the same standard as primary units, creating a false sense of security. OxMaint's redundancy-aware PM scheduling prevents this by ensuring standby units are verified monthly for actual capacity, not assumed ready.
Sign in to OxMaint to configure redundancy-aware PM for your cooling infrastructure.
How much can PUE improvement from maintenance compliance save a data center annually?
For a facility running 4 MW average IT load at $0.08/kWh, moving PUE from 1.74 to 1.41 saves approximately $420,000 per year in energy costs — primarily from coil cleaning compliance (maintaining heat transfer efficiency), refrigerant charge accuracy (preventing compressor overwork), and cooling tower approach temperature management. PUE degradation from maintenance drift is responsible for 30–40% of efficiency loss in aging facilities. The OxMaint maintenance programme in this case study produced a 0.33 PUE improvement within 18 months through structured PM completion — no capital investment in new equipment required.
Book a demo to see PUE tracking linked to maintenance compliance in OxMaint.
How does OxMaint's IoT integration differ from the BMS monitoring a data center already has?
BMS monitoring is designed for real-time operational control — it fires an alarm when a threshold is crossed and expects an operator to respond. It is not designed to detect trend-based degradation before the threshold is reached, track PM completion against a schedule, enforce that defects generate work orders, or provide compliance documentation for audits. OxMaint operates as the maintenance layer above the BMS — ingesting sensor data to detect trends days or weeks before alarms would fire, connecting those trends to PM schedules and corrective work orders, and generating the audit-ready documentation that Tier III compliance and enterprise customer contracts require. The two systems are complementary: BMS for real-time operations, OxMaint for maintenance intelligence and compliance.