Maintaining Mission Critical Systems in a 24/7 Environment. Peter M. Curtis
Чтение книги онлайн.
Читать онлайн книгу Maintaining Mission Critical Systems in a 24/7 Environment - Peter M. Curtis страница 38
No matter which firms you choose, always ask for sample reports, testing procedures, and references. Your decisions will determine the system’s ultimate reliability, as well as how easy the system is to maintain. Seek experienced professionals from both your own company, and third parties for information systems, property and operations managers, space planners, and the best consultants in the industry for all engineering disciplines. The bottom line is to have proven organizations working on your project. Systems that are not designed, installed, and operated optimally by your operations team will only hurt the operations of your company and cause discontent down the road.
3.5 The Mission Critical Facilities Manager and the Importance of the Boardroom
To date, the mission critical facilities manager has not achieved high levels of prestige within the corporate world. This means that if the requirements are 24/7, forever, the mission critical facilities manager must work hard to have a voice in the boardroom. The board can then become a powerful voice that supports the facilities manager and establish a standard for managing the risks associated with older equipment or maintenance cuts. For instance, relying on a UPS system that has reached the end of its useful life, but is still deployed due to cost constraints increases the risk of failure. The facilities manager is in a unique position to advise and paint vivid scenarios to the board. Imagine incurring losses due to downtime, plus damage to the capital equipment that is keeping the Fortune 1000 company in business.
Board members understand this language: it is comparable to managing and analyzing risk in other avenues, such as whether to invest in emerging markets in unstable economies. The risk is one and the same; the loss is measured in the bottom line.
The facilities engineering department should be run and evaluated just like any other business line; it should show a profit. But instead of increased revenue, the business line shows increased uptime, which can be equated monetarily, plus far less risk. It is imperative that the facilities engineering department be given the tools and the human resources necessary to implement the correct preventative maintenance training, and document management requirements, with the support of all company business lines.
3.6 Quantifying Reliability and Availability
Data center reliability ultimately depends on the organization as a whole weighing the dangers of outages against available enhancement measures. Reliability modeling is an essential tool for designing and evaluating mission critical facilities. The conceptual phase, or programming, of the design, should include a full Probabilistic Risk Assessment (PRA) methodology. The design team must quantify performance (reliability and availability) against cost in order to push fundamental design decisions through the approval process.
Reliability predictions are only as good as the ability to model the actual system. In past reliability studies, major insight was gained into various electrical distribution configurations using IEEE Standard 493‐2007 Recommended Practice for the Design of Reliable Industrial and Commercial Power Systems, or, The IEEE Gold Book. It is also the major source of data on failure and repair rates for electrical equipment. There are, however, aspects of the electrical distribution system for a critical facility that differ from other industrial and commercial facilities. Therefore, internal data accumulated from the engineer’s practical experience is needed to complement the Gold Book information.
Reliability analysis with PRA software provides a number of significant improvements over earlier, conventional reliability methods. The software incorporates reliability models to evaluate and calculate reliability, availability, unreliability, and unavailability. The results are compared to a cost analysis to help reach a decision about facility design. The process of evaluating reliability includes:
Analyzing any existing systems and calculating the reliability of the facility as it is currently configured.
Developing solutions that will increase the reliability of the facility.
Calculate the reliability with the solutions applied to the existing systems.
Evaluate the cost of applying the solutions.
3.6.1 Review of Reliability Terminology
Reliability (R) is the probability that a product or service will operate properly for a specified period of time under design operating conditions without failure.
The failure rate (λ) is defined as the probability that a failure per unit time occurs in the interval, given that no failure has occurred prior to the beginning of the interval.
For a constant failure rate λ, reliability as a function of time is:
Mean time between failures (MTBF), as its name implies, is the mean of the probability distribution function of failure. For a statistically large sample, it is the average time the equipment performed its intended function between failures. For the example of a constant failure rate:
Mean time to repair (MTTR) is the average time it takes to repair the failure and get the equipment back into service.
Availability (A): Availability is the long‐term average fraction of time that a component or system is in service and satisfactorily performing its intended function. This is also called steady‐state availability. Availability is defined as the mean time between failures divided by the mean time between failures plus the mean time to repair:
High reliability means that there is a high probability of good performance in a given time interval. High availability is a function of failure frequency and repair times and is a more accurate indication of data center performance.
As more and more buildings are required to deliver service guarantees, management must decide what performance is required from the facility. Availability levels of 99.999% (5.25 minutes of downtime per year) allow virtually no facility downtime for maintenance or other planned or unplanned events. Therefore, moving toward high reliability is imperative. Since the 1980s, the overall percentage of downtime events caused by facilities has grown as computers become more reliable. And although this percentage remains small, the total availability is dramatically affected, because repair times for facility events are so high. A further analysis of downtime caused by facility failures indicates that utility outages have actually declined, primarily due to the installation of standby generators.
The most common response to these trends is reactive: that is, spending time and resources to repair the offender. If a utility goes down, install a generator. If a ground‐fault trips critical loads, redesign the distribution system. If a lightning strike burns power supplies, install a new lightning protection system. Such measures certainly make sense, as they address real risks in the data center. However, strategic planning can identify internal risks and provide a prioritized plan for reliability improvements. Planning and careful implementation will minimize disruptions while making the business case to fund these projects.
As technological advances