Evaluation of the Reliability of a Standby Redundancy System Under Real Conditions
Hedi A. Guesmi* and Sayed O. Madbouly
Department of Electrical Engineering, College of Engineering, Qassim University, Saudi Arabia
E-mail: h.guesmi@qu.edu.sa; so.ossman@qu.edu.sa
*Corresponding Author
Received 13 February 2024; Accepted 11 April 2024
This paper presents a novel analysis of a standby redundancy system under real conditions. The study considers the presence of a real commutator and failure detector in the system. Through a comprehensive failure mode analysis, mathematical relationships between different module characteristics are established. The results of investigation provide valuable insights for manufacturers, allowing them to evaluate the Mean Time To Failure (MTTF) of the system during the design phase and make informed decisions regarding the selection of failure rates for the detector and commutator. Overall, this work contributes to the effective operation of standby redundancy systems in practical applications.
Keywords: Reliability, standby redundancy, mean time to failure.
SR | : | Standby Redundancy |
PD | : | Principal device |
SD | : | Standby device |
MTTF | : | Mean Time To Failure |
T | : | Mean Time To Failure of PD and SD |
: | Survivor function of PD and SD | |
: | Failure density function of PD and SD | |
: | Failure distribution function of PD and SD | |
: | Failure rate of PD and SD | |
DF | : | Detector of Failure |
: | Failure distribution function of detector | |
: | Failure rate of detector | |
: | Reliability of commutator | |
: | Failure rate of commutator | |
: | Symbol of events intersection | |
: | Logical complement of C | |
SRC | : | Synthetic Reliability of Commutator |
Reliability is a fundamental concept in engineering and refers to the ability of a system, product, or component to perform its intended function without failure over a specified period of time and under specified conditions. It is a critical factor in ensuring the dependability and trustworthiness of various systems, ranging from everyday consumer products to complex industrial systems. Reliability is essential because failures can lead to various undesirable consequences, including financial losses, damage to reputation, safety hazards, and disruptions to operations. Therefore, engineers and designers strive to develop reliable systems that meet the required performance standards and maintain their functionality over their intended lifespan [1–3].
Reliability consideration is very much a part of the design stage of a system. However, it is beneficial to first consider the main ways that the reliability of a system can be modified. There are two main ways in which reliability can be affected. The first relates to quality, and the second to redundancy [4–7].
Redundancy involves incorporating duplicate components or systems within a larger system to provide backup or alternative pathways in case of failure. Redundancy can improve system reliability by reducing the likelihood and impact of failures. Redundancy is of two types: active and standby [8–10].
Active redundancy involves the simultaneous operation of redundant components or systems, where they share the load and provide immediate backup in case of failure. Active redundancy can help maintain system functionality without interruption [10–14].
Standby redundancy, also known as cold redundancy, involves detection mechanisms and switching, redundant components or systems that are on standby and activated only when the primary component or system fails [15–18].
The detection mechanism in the standby redundancy is responsible for monitoring the operational status of the primary component and identifying when it fails. It can employ various techniques, such as sensors, monitoring circuits, or feedback signals. However, the detection mechanism may not always provide immediate or foolproof detection, and there can be instances of false positives or false negatives [19–21].
The switching mechanism in the standby redundancy is responsible for activating the redundant component upon detection of a failure in the primary component. It can involve mechanical, electrical, or software-based switching mechanism. The switching process may introduce a certain delay or transient period during the switchover, which can impact the system’s reliability [8, 9].
In standby redundancy, detection of the failed component and activation of the redundant one are both performed by a detection and switching mechanism not necessarily perfect. However, these mechanisms may not be perfect and can introduce some degree of uncertainty or delay in detecting and switching to the redundant component. This can impact the overall reliability of the standby redundancy system [9, 22, 23].
The reliability of the detection and switching mechanism is a critical aspect of standby redundancy systems. The probability of correctly detecting a failure and switching to the redundant component when needed is a key factor in determining the overall reliability of the system. This reliability can be influenced by factors such as the design and quality of the detection and switching mechanisms, the effectiveness of monitoring and feedback systems, and the response time of the switching process [20, 24].
Lots of work has been done by many researchers in the area of reliability, and in particular, standby redundancy. Sharifi and Pedram[7] derived the formula to calculate the reliability of the system with an active redundancy strategy using the Markov process when the components’ life have a Weibull distribution. Jasdev B. and Mohit K.[5] have analyzed the active standby redundancy when initially both similar units are observed to be in an operative situation. Chao G. and Xing M.[8] have studied the stability of a standby system with an unreliable server and switching failure, where both the time-to-repair of units and the time-to-repair of the server follow general distributions. Dong Y. Chia H.[9] has involved the evaluation of the availability and reliability of a repairable system containing warm standby components and undergoing switching failure. Linmin H. and Zhuoxin B.[17] have studied active redundancy under the assumption that the switch is completely reliable. Kuo H. and Chia H.[20] have analyzed the case when the primary and standby virtual machines are both assumed to be unreliable with warm standbys and switching failure. Ying L. and Jau Ch.[25] have analyzed the characteristics of a redundant repairable system when switching to standby fails, and times to failure and times to repair of the operating units are assumed to follow exponential distributions.
Most of the above-mentioned studies and also [9, 10, 19, 23, 25] are all about the reliability of standby redundancy systems. They are based on purely theoretical considerations; they assume the case where the switch and failure detector are ideal [24, 26–29]. In reality, these two devices can fail either in the standby phase or during the switching phase.
On the other hand, the assumption of a constant failure rate (exponential distribution) implies that the failure probability remains constant over time and is not dependent on the time since the last failure. This simplifies the calculations of reliability and availability for the active redundancy system, as the exponential distribution has well-defined mathematical properties [5, 8, 15]. This article focuses on standby redundancy in real-world conditions where the commutator and detector of failure are both not perfect but are characterized by a constant failure rate during their operation. The present paper focuses on the problem of redundancy in the real case, where the fault detector and the switch are considered not perfect but real, characterized by a known failure rate.
In general, a standby redundancy system is composed of a primary device (PD), a standby device (SD), a commutator (C), and a failure detector, as shown in Figure 1.
Figure 1 Structure of standby redundancy system.
The primary device (PD) and the standby device (SD) are characterized, respectively, by a constant failure rate . The commutator is characterized by a constant switching probability Rc throughout the entire operating period, while the failure detector is characterized by a constant failure rate .
The analysis of failure modes in this system leads us to distinguish the following three events, as indicated in Figure 2, where:
– symbol denotes the intersection of events and
– symbol represents the logical complement of event X.
Figure 2 Failure modes of the considering system.
Event 1
• The principal device fails within the time interval (t, t dt),
• The failure detector fails to detect the failure of the principal device.
Therefore, the probability of this event is:
(1) |
Event 2
• The principal device has failed within the time interval dt,
• The failure detector detects the failure.
• However, the commutator is not working.
In this case, the probability of this event is:
(2) |
Event 3
• The principal device has failed within the time interval (, d ) with t.
• The failure detector detects the failure.
• The commutator is working.
• However, the standby device is not operational, having failed during the standby phase.
In this case, the probability of this event is:
(3) |
Since the three previous events are independent of each other, the failure density function of the standby redundancy system is given by:
(4) |
Let’s assume now that the failures of the primary device and the auxiliary device follow the exponential distribution with a constant failure rate .
(5) |
and the failure density function of PD and SD are:
(6) |
where is the constant failure rate for both components (PD and SD).
Let’s also assume that the failure distribution function of the failure detector, denoted as F(t), also follows the exponential distribution with a constant failure rate . In this case, the failure distribution function for the failure detector can be expressed as:
(7) |
where is the failure rate of the failure detector.
Considering also that the switch is characterized by a constant switching probability R. In this case, the failure density function of the considered system can be expressed as follows:
(8) |
After integration expression (8) and simplification, the reliability function of this system can be expressed by the following Equation (9).
(9) |
This last relationship is extremely important from a practical point of view, especially when it comes to evaluating the level of the reliability during the design phase of a strategic system where the consequences of a failure can be catastrophic. The role of the manufacturer and designer of such a system is to set a maximum margin for the failure rate, particularly for the failure detector, to ensure that the considered system meets the required reliability requirements.
Generally, these requirements are expressed in two forms:
– Either in terms of the mean time to failure MTTF or,
– In terms of the probability of functioning properly until a specific time t.
In this paper, our focus is limited to the case where the reliability requirement assurance is expressed specifically in terms of MTTF. Equation (7) provides the expression for the mean time to failure of the active redundancy system TAR.
(10) |
Which can be formulated using Equation (9) and after integration and simplification, by the following relation:
(11) |
Where T 1/ represents the MTTF of the primary device, and .
The normalized mean time T/T of the standby redundancy system, as a function of the ratio of the failure rate of the primary device to the failure rate of the failure detector , with the commutator reliability R as a parameter, is presented in Figure 3.
The normalized mean time T/T of the standby redundancy system, as a function of the ratio of the failure rate of the primary device to the failure rate of the failure detector , with the commutator reliability R as a parameter, is presented in Figure 3.
Figure 3 Normalized average time characteristics T/T as a function of k log with R as a parameter.
The family of curves in Figure 3 is characterized by the existence of a certain level beyond which the reduction in the failure rate of the detector compared to the failure rate of the principal device has no significant influence on improving the mean time to failure of the standby redundancy system. In other words, for a given level of R, the detector of failure is considered ideal when its failure rate is approximately 100 times smaller than the failure rate of the primary device .
In practice, the reliability of the commutator R is often not constant but decreases over time during system operation. According to the principle of on-demand operation of the commutator, the random variable t, which determines the life time of PD, also represents the waiting time for operation, as at time t, PD will be switched Off and SD will be switched ON with a probability equal to:
(12) |
Where represents the failure rate of commutator.
This paper presents a novel indicator called Synthetic Reliability of Commutator (SRC), represented as . It is defined as the average probability of correct operation of the commutator on demand over all possible operating times until the failure of PD:
(13) |
By substituting Equations (6) and (12) into expression (13) and integrating, we obtain:
(14) |
The derived expression , presented as a function of the ratio of the commutator failure rate to the failure rate of the standby device , has an interesting physical interpretation.
Since
(15) |
So, a real commutator approaches ideal behavior when:
– The failure rate of the commutator is equal to zero, (i)
– The failure rate of the commutator is relatively small compared to the failure rate of the standby device (ii).
The condition (i), mentioned in [1–3], is commonly considered as the criterion for an ideal commutator.
On the other side, the condition (ii) holds special significance in practice, as it indicates that the ideality of a commutator is a relative concept and is dependent on the mutual relationship between the commutator failure rate and the failure rate of the standby device .
By transforming the relationship (14), we obtain the failure rate of the commutator:
(16) |
So far, the exact relationship between and is not fully determined, as indicated by Equation (16), where the ratio is dependent on the reliability of the commutator R. On the other side, R cannot take any arbitrary value but is determined based on specific reliability requirements.
In fact, the general shape of the failure rate function of any technical device follows a bathtub curve, which can be primarily divided into three different intervals:
– Infant mortality interval corresponds to failure of “week” items,
– Useful life interval corresponds to externally induced failures and
– Wear-out interval corresponds to wear out failures of “good” items.
Figure 4 Normalized Bathtub curve-typical shape of the failure rate.
The period where the failure rate starts to increase, known as the wear-out period. Users of such systems can mitigate the increase in failure rate through systematic controls and the implementation of regular preventive maintenance policies. By doing so, they can prolong the useful life of the system and maintain a relatively constant failure rate.
The period of infant mortality, characterized by a decreasing hazard rate. In most cases, this decrease is rapid and stabilizes to a relatively constant failure rate. This is often observed in electronic components and systems. Manufacturers of electronic systems typically eliminate this rapid decrease in failure rate by subjecting the components to rigorous testing involving extreme conditions of temperature, pressure, humidity, and vibrations. This process helps eliminate weak components from the market, resulting in products that exhibit a relatively constant failure rate over a long period of operation.
On the other hand, obtaining reliability characteristics for standby redundancy systems presents analytical challenges (look at Equations: (4), (9), (10), (13)), especially when the constituent components have failure rates that deviate from the exponential model and instead follow the Weibull distribution with a shape parameter “b”.
(17) |
A value of (b 1) corresponds to a decreasing failure rate, while a value of (b 1) corresponds to an increasing failure rate, (b 1) corresponds to a constant failure rate.
At this stage, we can consider the primary Principal device and the standby device as obeying the Weibull distribution, while the switch and failure detector follow the exponential distribution. If this is not the case, analytical resolution of the standby redundancy problem becomes infeasible. In such situations, statistical simulation methods like Monte Carlo simulation can be employed.
This article presents a comprehensive investigation into the implications of non-ideal components within standby redundancy systems. The study focuses on the challenges encountered during the standby and switching phases, including switching delays, transient periods, switching errors, false alarms, and missed failures.
In terms of reliability requirements expressed as the mean time to failure, the research findings highlight the significance of parameter selection for the failure detector and commutator in achieving optimal system performance. Specifically, it is observed that a failure detector with a substantially lower failure rate () compared to the primary device failure rate () is essential. Empirically, a ratio of that is at least 100 times smaller is recommended to attain an ideal active redundancy system.
Furthermore, the concept of Synthetic Reliability of the Commutator is introduced as a metric for evaluating the performance of real commutators. The analysis reveals that a commutator can be considered to operate with near-perfection when the ratio of the commutator failure rate () to the principal device failure rate () is significantly smaller than 1.
Overall, these research findings contribute to a deeper understanding of the implications of non-ideal components in standby redundancy systems. The insights gained from this study provide valuable guidelines for parameter selection and introduce a novel metric to assess the performance of commutators in real-world scenarios.
As a future work in this paper, we can consider studying the general case where the failure of the switch, failure detector, as well the primary and auxiliary devices, is not constant but follows the Weibull model . This model, with its shape parameter “b” allows distinguishing between the infant mortality period (b 1) and the wear-out period (b 1). This approach can be considered very close to reality and can be the subject of extending this paper to the general case where the failure rate of all active redundancy devices is no longer constant but variable over time.
[1] Gregory Levitin, Liudong Xing, Yuanshun Dai, “Reliability Versus Expected Mission Cost and Uncompleted Work in Heterogeneous Warm Standby Multiphase Systems”, IEEE Transactions on Systems, Man, and Cybernetics: Systems, vol. 47, no. 3, pp. 462–473, 2017.
[2] P. O’Connor, «Practical reliability engineering». Willey 1986.
[3] R.B. Billinton, R. Allant, «Reliability evaluation of engineering systems», Plenum Press 1983.
[4] Julian Salomon, Niklas Winnewisser, Pengfei Wei, Matteo Broggi, Michael Beer, Efficient reliability analysis of complex systems in consideration of imprecision, Reliability Engineering & System Safety, Volume 216, 2021.
[5] Bhatti, Jasdev, Mohit Kumar Kakkar, Nitin Bhardwaj, Manpreet Kaur, and G. Deepika. “Reliability analysis to industrial active standby redundant system.” Malaysian Journal of Science (2020): 74–84.
[6] Martyushev, Nikita V., Boris V. Malozyomov, Svetlana N. Sorokova, Egor A. Efremenkov, Denis V. Valuev, and Mengxu Qi. “Review models and methods for determining and predicting the reliability of technical systems and transport.” Mathematics 11, no. 15 (2023): 3317.
[7] Sharifi, Mani, Pedram Pourkarim Guilani, Arash Zaretalab, and Abdolreza Abhari. “Reliability evaluation of a system with active redundancy strategy and load-sharing time-dependent failure rate components using Markov process.” Communications in Statistics-Theory and Methods 52, no. 13 (2023): 4514–4533.
[8] Chao Gao, Xing-Min Chen, “Stability analysis of a standby system with an unreliable server and switching failure”, IMA Journal of Applied Mathematics, 2022.
[9] Dong-Yuh Yang, Chia-Huang Wu, “Evaluation of the availability and reliability of a standby repairable system incorporating imperfect switchovers and working breakdowns”, Reliability Engineering & System Safety, vol. 207, pp. 107366, 2021.
[10] Gao, Chao, Yongjin Guo, Mingjun Zhong, Xiaofeng Liang, Hongdong Wang, and Hong Yi. “Reliability analysis based on dynamic Bayesian networks: A case study of an unmanned surface vessel.” Ocean Engineering 240 (2021): 109970.
[11] Oszczypała, Mateusz, Jakub Konwerski, Jarosław Ziółkowski, and Jerzy Małachowski. “Reliability analysis and redundancy optimization of k-out-of-n systems with random variable k using continuous time Markov chain and Monte Carlo simulation.” Reliability Engineering & System Safety 242 (2024): 109780.
[12] Guo, Linhan, Ruiyang Li, Yu Wang, Jun Yang, Yu Liu, Yiming Chen, and Jianguo Zhang. “Availability for multi-component k-out-of-n: G warm-standby system in series with shut-off rule of suspended animation.” Reliability Engineering & System Safety 233 (2023): 109106.
[13] Dui, Hongyan, Huiting Xu, and Yun-An Zhang. “Reliability Analysis and Redundancy Optimization of a Command Post Phased-Mission System.” Mathematics 10, no. 22 (2022): 4180.
[14] Ma, Xiaoyang, Bin Liu, Li Yang, Rui Peng, and Xiaodong Zhang. “Reliability analysis and condition-based maintenance optimization for a warm standby cooling system.” Reliability Engineering & System Safety 193 (2020): 106588.
[15] Mohamed Kayid, Mashael A. Alshehri, “Stochastic Comparisons of Lifetimes of Used Standby Systems”, Mathematics, vol. 11, no. 14, pp. 3042, 2023.
[16] Malhotra, Reetu. “Reliability and availability analysis of a standby system with activation time and varying demand.” In Engineering Reliability and Risk Assessment, pp. 35–51. Elsevier, 2023.
[17] Hu, Linmin, Zhuoxin Bai, Xiangfeng Yang, and Mingjia Li. “Reliability modeling and evaluation of uncertain random cold standby k-out-of-m+ n: G systems.” Journal of Ambient Intelligence and Humanized Computing 14, no. 10 (2023): 13833–13846.
[18] Chandra Shekhar, Amit Kumar, Shreekant Varshney, Sherif I. Ammar, “Fault-tolerant redundant repairable system with different failures and delays”, Engineering Computations, vol. ahead-of-print, no. ahead-of-print, 2019.
[19] Ze Wang, Ying Chen, Weiyang Men, “Failure Behavior Analysis of 1-Out-of-N Hot-Standby System with Imperfect Switch”, 2018 12th International Conference on Reliability, Maintainability, and Safety (ICRMS), pp. 355–362, 2018.
[20] Wang, Kuo-Hsiung, Chia-Huang Wu, and Tseng-Chang Yen. “Reliability analysis of redundant retrial machining system subject to standby switching failure.” Quality Technology & Quantitative Management 20, no. 5 (2023): 561–576.
[21] Ruiz, Cesar, Edward A. Pohl, and Haitao Liao. “Selective maintenance modeling and analysis of a complex system with dependent failure modes.” Quality Engineering 32, no. 3, 2020.
[22] Q. Qiu, L. Cui, and D. Kong, “Availability and maintenance modeling for a two-component system with dependent failures over a finite time horizon,” Proceedings of the Institution of Mechanical Engineers, Part O: Journal of Risk and Reliability, vol. 233, no. 2, pp. 200–210, 2019.
[23] Wu, Hao, Yanwen Xu, Zheng Liu, and Pingfeng Wang. “Mean Time to Failure Prediction for Complex Systems With Adaptive Surrogate Modeling.” In International Design Engineering Technical Conferences and Computers and Information in Engineering Conference, vol. 87301, p. V03AT03A051. American Society of Mechanical Engineers, 2023.
[24] Yang, Chen, Wanze Lu, and Yuanqing Xia. “Reliability-constrained optimal attitude-vibration control for rigid-flexible coupling satellite using interval dimension-wise analysis.” Reliability Engineering & System Safety 237, 2023.
[25] Ying-Lin Hsu, Jau-Chuan Ke, Ssu-Lang Lee, “On a redundant repairable system with switching failure: Bayesian approach”, Journal of Statistical Computation and Simulation, vol. 78, no. 12, pp. 1163, 2008.
[26] Wang, Chaonan, Xiaolei Wang, Liudong Xing, Quanlong Guan, Chunhui Yang, and Min Yu. “Efficient reliability approximation for large k-out-of-n cold standby systems with position-dependent component lifetime distributions.” Reliability Engineering & System Safety 240 (2023): 109548.
[27] Shaoxuan Wang, Yuantao Yao, Daochuan Ge, Zhixian Lin, Jie Wu, Jie Yu, “Reliability evaluation of standby redundant systems based on the survival signatures methods” Reliability Engineering & System Safety, Volume 239, 2023, 109509, ISSN 0951-8320.
[28] Malhotra, R., Alamri, F.S., Khalifa, H.A.E.-W. Novel Analysis between Two-Unit Hot and Cold Standby Redundant Systems with Varied Demand. Symmetry 2023, 15, 1220.
[29] Monika Manglik, Mangey Ram: “Reliability analysis of a two unit cold standby system using markov process” . Journal of Reliability and Statistical Studies; ISSN (Print): 0974-8024, (Online): 2229-5666, Vol. 6, Issue 2 (2013): 65–80.
Hedi A. Guesmi received the B.Sc./M.Sc., and Ph.D. degrees in Biomedical Engineering from Technical University of Gdansk, Faculty of Electronics – POLAND, in 1990, and 1994 respectively. From 2002 – 2003 he is an Assistant Professor in Biomedical Engineering Department, Higher Institute of Medical Technologies (ISTMT) Tunis, Tunisia. From 2003 – 2007 he is an Assistant Professor in Electronics Department, Higher Institute of Applied Sciences and Technology (ISSAT) Mateur, Tunisia. From 2007 – 2018 he is an Assistant Professor in Department of Medical Equipment Technology, College of Applied Medical Sciences-, Majmaah University, Saudi Arabia. From 2018 – 2021 he is an Assistant Professor in Biomedical Department, College of Applied Heath Sciences in Arrass – Qassim University, Saudi Arabia. From 2021 – Until now he is an Assistant Professor in the Electrical Engineering Department, College of Engineering, Qassim University, Saudi Arabia. His major research interests include reliability, safety, biomedical engineering and electronics.
Sayed O. Madbouly received the B.Sc., M.Sc., and Ph.D. degrees in Electrical Engineering from Ain Shams University, Egypt, in 1998, 2004, and 2010 respectively. From 2010 to 2013, he is an Assistant Professor in the Electrical Power and Machines Department, Faculty of Engineering, Ain Shams University, Egypt. Since 2013, he is an Assistant Professor in the Electrical Engineering Department, College of Engineering, Qassim University, Saudi Arabia. His major research interests include electrical machines, renewable energy, fuzzy logic control and vector control of electrical machines.
Journal of Reliability and Statistical Studies, Vol. 16, Issue 2 (2023), 357–372.
doi: 10.13052/jrss0974-8024.1629
© 2024 River Publishers