Maintaining Mission Critical Systems in a 24/7 Environment. Peter M. Curtis

Чтение книги онлайн.

Читать онлайн книгу Maintaining Mission Critical Systems in a 24/7 Environment - Peter M. Curtis страница 18

Автор:
Жанр:
Серия:
Издательство:
Maintaining Mission Critical Systems in a 24/7 Environment - Peter M. Curtis

Скачать книгу

needs of Data Center environments. The RMA is an exercise that produces a system of detailed, documented processes, procedures, checks, and balances designed to minimize operator and service provider errors. The practice CAP ensures that only trained and qualified people are associated and authorized to have access to critical sites. These programs, coupled with Probability Risk Assessment (PRA), address the hazards of data center uptime. The PRA looks at the probability of failure of each type of electrical power equipment. Performing a PRA can be used to predict availability, number of failures per year, and annual downtime. The PRA, RMA, and CAP are facilitating agents when assessing each step listed below.

       Engineering and design

       Project management

       Testing and commissioning

       Documentation

       Education and training

       Operation and maintenance

       Employee certification

       Risk indicators related to ignoring facility process management

       Standard and benchmarking

      Industry regulations & policies continue to be more stringent than ever. They are heavily influenced by Basel II, Sarbanes‐Oxley Act (SOX), NFPA 1600, and U.S. Securities and Exchange Commission (SEC). Basel II recommends “three pillars” ‐ risk appraisal and control, supervision of the assets, and monitoring of the financial market ‐ to bring stability to the financial system and other critical industries. Basel II implementation involves identifying operational risk then allocating adequate capital to cover potential loss. As a response to corporate scandals in the close to decades ago, SOX came into force in 2002 and passed the following act: The financial statement published by issuers is required to be accurate (Sec 401); issuers are required to publish information in their annual reports (Sec 404); issuers are required to disclose to the public, on an urgent basis, information on material changes in their financial condition or operations (Sec 409); and impose penalties of fines and /or imprisonment for not complying (Sec 802). The purpose of the NFPA 1600 Standard is to help the disaster management, emergency management, and business continuity communities to cope with critical events. Keeping up with the rapid changes in technology has been a longstanding priority. The constant dilemma of meeting the required changes within an already constrained budget can become a limiting factor in achieving optimum reliability.

      1.2.1 Levels of Risk

Risk Impact Effects of System Failure
High It will cause an immediate interruption to the clients’ critical operations such as:Activity requiring a planned major utility service outage, or temporary elimination in system redundancy of the critical environment.Activity that would disrupt critical production operations.Activity that would likely result in an unplanned outage or disruption of operations, if unsuccessful.
Medium There is time to recover without impacting the clients' critical operations including any:Activity requiring a planned service outage that does not affect systems, but may impact non‐critical operations.Activity that involves a significant reduction in system redundancy.Activity that is not likely to result in an unplanned outage to the critical environment or disruption of operations, if unsuccessful.
Low It will not interrupt operations and will have minimum potential of affecting the clients' critical operations including:Activity involving systems directly supporting operations but the execution of which will be transparent to operations.Activity that cannot result in an unplanned outage of the critical environment or impact operations, if unsuccessful.
None Activity not associated with the critical environment.

      Critical industries are operating continuously, 365 days. Because conducting daily operations necessitate the use of new technology, more and more applications are packed into servers, and servers are being packed into a single cabinet. The growing number of servers operating 24/7 increases the need for power, cooling, and airflow. When a disaster causes the facility to experience lengthy downtime, a prepared organization is able to quickly resume normal business operations by using a predetermined recovery strategy. Strategy selection involves focusing on key risk areas and selecting a strategy for each one. Also, in an effort to boost reliability and security, the potential impacts and probabilities of these risks, as well as the costs to prevent or mitigate damages and the time to recover, should be established.

      One major area that necessitates strategy development is the banking and financial service industry. The absence of strategy that guarantees recovery has an impact on employees, facilities, power, customer service, billing, and customer and public relations. All areas require a clear, well‐thought‐out strategy based on recovery time objectives, cost, profitability impact, and safety. The strategic decision is based on some of the following factors:

       The maximum allowable delay time prior to the initiation of the recovery process.

       The time frame required to execute the recovery process once it begins.

       The minimum computer configurations required to process critical applications.

       The minimum communication device and backup circuits required for critical applications.

       The minimum space requirements for essential staff members and equipment.

       The total cost involved in the recovery process and the total loss as a result of down time.

Скачать книгу