Root Cause Failure Analysis
DEFINITION
Root Cause Analysis (RCA) is the investigative process employed to determine the underlying event(s) responsible for failure(s). Failures are associated with part integrity, proper functioning of a complete system or the execution of an engineering process. They are most often classified as being either mechanical, electrical or software in nature. The underlying “root cause” event can be associated with the design, manufacturing, end usage conditions (e.g. human factors) as well as other elements of a system’s design.
SITUATION
Failures prevent reliability goals from being realized. Furthermore these failures can be of a unique nature requiring specialized tools and methodology to properly diagnose and understand the root cause. Often, an organization’s internal resources are limited in scope and not properly equipped to address all failures their products may be experiencing. OPS A La Carte is continuously building resources both internally, and with our affiliates, in order to solve a broad range of problems quickly and effectively.
OBJECTIVE
The objective of RCA is to identify the root cause of all failures so that corrective action can be initiated. It is an essential element to any reliability improvement program. RCA can be pursued making the best use of a company’s internal resources and the specialized skills and methodology offered by OPS A La Carte. Additionally, OPS A La Carte can support a company’s efforts to incorporate elements of a RCA program (e.g. FEA, fatigue methodology, statistical data analysis, etc.) into their existing engineering capabilities.
VALUE TO YOUR ORGANIZATION
The value in OPS A La Carte’s RCA capabilities is to support company’s efforts to rectify failures and enable the ongoing use of RCA tools and methods either independently or in a cooperative effort. Some specific values OPS A La Carte’s RCA capabilities offer are listed here:
- Implement a multi-disciplinary approach capable of solving a broad range of problems quickly and effectively saving both time and money
- Partner with over 50 other specialized firms to continue expanding the RCA expertise
- Work independently or in cooperation with other teams
- Experienced with many problems due to a vast experience level that is routinely engaged
- Identify, test and verify potential design changes
- RCA services are independent and unbiased, capable of satisfying any and all customer inquiries
- Identify the appropriate reliability metrics to measure the resulting improvements to product reliability
RELIABILITY INTEGRATION
RCA is the foundation for every effective reliability program. It is leveraged in both the development of new products and the reliability improvement programs associated with an existing product base. The FMEA and FTA reliability modeling tools as well as the ALT and HALT testing procedures are all designed to identify (potential) failures. RCA is the essential follow-up activity that is must be effective in order to resolve and eliminate the failures. Similarly for existing products, the FRACAS process identifies and prioritizes field failures that ultimately require an effective RCA program. Effective RCA programs not only improve reliability performance but also reduce warranty costs that improve an organization’s profitability.
METHODOLOGY
We use the classic 7-step process as follows:
- Define/Identify the problem.
- Analyze and gather data/evidence.
- Determine Root Cause: Ask why and identify the causal relationships associated with the defined problem Identify which causes, if removed or changed, will prevent recurrence.
- Choose solutions and action plan: Identify effective solutions that prevent recurrence, are within your control, meet your goals and objectives and do not cause other problems. The solution should not only address the problem at hand, but also assure that similar problems also do not occur.
- Implement solution(s) and the recommendations.
- Evaluate: Observe the recommended solutions to ensure effectiveness.
- Act/report on what was learned.
Note that as part of the documentation of the issue and entry into the Closed-Loop Corrective Action (CLCA) System, we must also determine the priority of the issue and the relevancy of the issue. These will obviously play a part in how we approach the failure analysis. The above steps assume that the CLCA System requested that failure analysis be pursued.
EXPERIENCE
OPS A La Carte has expertise at both the component and system level for a wide variety of different areas, including electrical, mechanical, materials, chemical, optical, and software. An overview of these capabilities for each engineering discipline is summarized below.
Electrical
OPS A La Carte has expertise with Systems, Printed Circuit Boards, Interconnects, Parts and Die Level issues. Specifically;
- PCB’s: Identify manufacturing defects, conductive anodic filaments (CAF), plated through hole fatigue, and electrochemical migration, Handling issues and ESD effects.
- Interconnects:Solderability issues, overstress, intermetallic formation, and wearout failures due to stresses such as thermal cycling or vibration.
- Die-level issues:Passivation cracking, die cracking, ESD/EOS, electromigration, dielectric breakdown, and hot carrier injection, and MMIC and Hybrid Processes.
- During the process, different non-destructive and destructive failure analysis tools and techniques are utilizing including;
- Radiography
- Cross-sectioning
- Decapsulation
- Optical microscopy
- Electron microscopy
- Ion chromatography
- Surface analysis techniques such as FTIR, EDS, and XRF
- Material analysis techniques such as DSC, TMA, and TGA
- Mechanical analysis techniques such as micro-testing, bend testing, and pull testing.
Many of the failure analysis tools listed above for electrical related RCA are also used in Mechanical RCA of mechanical components.
Mechanical
OPS A La Carte consultants are experts in the following areas:
- Stress Analysis
- Fatigue and Fracture Mechanics
- Creep degradation
- Nonlinear finite element analysis (FEA)
- Computational fluid dynamics (CFD)
- Probabilistic evaluations
Our areas of focus include:
- Biomedical devices
- Shape memory metals
- Bioabsorbable polymers
- MEMs, electronic and miniature components
- Wireless sensors, telecommunications, and applications of nanotechnology
- Optics and lasers
We also have expertise in mechanical root cause analysis in semiconductor equipment, automotive, aerospace, petrochemical, nuclear power, wind turbine, solar, and many others.
Software
Root Cause Analysis with Software has both similarities and unique differences as compared with Hardware RCA. Software RCA is different in that finding the bug and fixing it (analogous to finding a hardware failure and redesigning it) is typically much easier than with Hardware RCA, and therefore, most companies do not need support at this level. Where Software RCA is similar is that once the bug is fixed, new processes need to be developed (code reviews, software FMEAs and FTAs, design rules, phase containment metric tracking, etc) to prevent this particular class of bugs from reappearing. This is similar to the corrective action portion of the Hardware RCA where we not only fix the problem and prevent that particular problem from recurring, but we fix the process that caused the problem in order to show continual improvement.
OPS A La Carte consultants are experts in:
- Software Failure Modes and Effects Analysis (FMEA)
- Software Fault Tree Analysis (FTA)
- Software Fault Tolerance
- Facilitation of Code Reviews
- Software Robustness and Coverage Testing Techniques
- Usage Profile-based Testing
- Best Practices Implementation
Our areas of focus are primarily embedded systems applications, drivers, and firmware – any applications that require high availability.
We will go in and analyze your processes to determine what caused the bug to occur and we will work with your team to modify the process to prevent recurrence. Some of these process changes are highlighted in our Software Reliability seminar. You can also refer to a presentation we developed for the June 2008 Applied Reliability Symposium. The title of the presentation is Software Failure Analysis.
CASE STUDIES
The following case studies provide example approaches. We shall tailor our approach to meet your specific situation.
Mechanical RCA: Client with a sleep-aid device had an issue with a blower assembly getting noisy over time.
Client Attemps to Fix Problem: The client tried using HALT but could not replicate the problem. The blower would fail catastrophically before showing any signs of wear. The issue would show up within 6-9 months in the field. The client tried several iterations of the design but the problem kept re-appearing
Ops Root Cause Analysis: Using Design of Experiments (DOE), we identified the 8 variables that most affected performance of the blower assembly and then we had the supplier build different version of the blower with these different variables in place. Then we put the equipment under an accelerated temperature test and monitored acoustic noise output as well as vibration from the motor.
Tools Used: DOE, Accelerated Temperature Cycling, Acoustic instruments, Accelerometers
Resolution: We were able to identify the blower assembly with the optimal design parameters within 30 days of accelerated testing. Client made the changes, put the product into production and has been running for over 3 years with no issues.
Electrical RCA: Client experiencing failed electrolytic capacitors at a vastly different rate on two different power supply vendors
Client Attemps to Fix Problem: The client analyzed the data using various analytical techniques and concluded that there was an issue with the capacitor. They contacted us to perform an Accelerated Life Test to prove that the capacitors from Vendor B were inferior to those of Vendor A.
Ops Root Cause Analysis: First, we identified the characteristics of the two capacitors. They appeared to have very similar life characteristics. Then we took thermal measurements of the capacitors and their adjacent components. We noticed a significant difference in the case temperature of each capacitor. Through further measurements, we determined the source of the problem was that on the capacitors on Vendor B’s power supply were mounted much closer to the main transformer, causing adjacent heating to take place.
Tools Used: Thermal Analysis, Thermal Measurements, Capacitor Lifetime Calculations
Resolution: We were able to identify the problem to be with the layout of the power supply. Since our client does not have control over the layout of their vendor’s power supply, they decided to discontinue using this power supply and look for alternate sources. And now they have a test they can use as part of the qualification process when looking for a second source.
Structural RCA: Client had a fracture of a mechanical member of a structure. Client needed to determine if it was related to design, material, manufacturing process, or potentially even the environment (high winds could have possibly added to the stress).
Client Attemps to Fix Problem: Client reviewed wind reports and various other reports and suspected that either the wind analysis was performed incorrectly or that inferior steel was used during manufacturing. This was a high visibility issue and client called us in right away.
Ops Root Cause Analysis: We visited the site in Asia, took lots of photos, and took back failed pieces. We put these samples through material analysis, and we calculated the loads and reviewed the wind report.
Tools Used: Material Analysis, Finite Element Analysis, Reviewed Weather Reports.
Resolution: The observed damage pattern in conjunction with finite element analysis and structural testing indicated a very high probability of damage due to tightening during installation. We also discovered a design change that was made after the wind analysis was performed, and this change weakened the structure.
Laser RCA: Client reported finding purple residue and particles pervading the interior cavity of Hybrid Laser Modulator Drivers and Amplifiers on work in progress that had been stored in dry nitrogen for several months.
Client Attemps to Fix Problem: None.
Ops Root Cause Analysis: The purple residue was found not only on the package bottom, but on many die surfaces and bonding wires. There were several factors in play to bring this condition about. Die bonding was performed with a silver loaded epoxy that had a propensity for separation while awaiting cure. These devices utilized an unusual assembly sequence that required two die attach and two wire bonding steps. An argon plasma clean was performed immediately prior to the second die attach activated surface states in the package, increasing crawl of the epoxy from its intended sites. An aggressive argon plasma clean at the end of assembly caused silver epoxy flakes within the epoxy to become loose and exposed. The purple color was tarnish on the silver flakes. Although storage was in dry nitrogen, the parts were packaged with their paper status documents which were held to the parts carriers with rubber bands.
Tools Used: Optical Microscopy, Scanning Electron Microscopy, Mechanical Micro Tools.
Resolution: Replace the first plasma clean with a chemical clean and a time delay, replace the silver loaded epoxy with a more stable material from another supplier, develop a less aggressive second plasma clean process and eliminate the paper and rubber bands (both sources of sulfur) from the storage cabinets. Only parts and sulfur free materials were to be stored in the cabinets.
Electrical RCA: Client reported low to no output power from GaAs microwave amplifier dice after system burn-in.
Client Attemps to Fix Problem: None.
Ops Root Cause Analysis: Technicians, ignoring failure reporting procedures, adjusted the amplifier bias jumpers for higher power output, leading to later failures. Optical Microscopy showed a burned resistor in parallel with a spiral inductor (should be dead short, resistor “could not” burn) and other minor metallization anomalies. Scanning Electron Microscopy showed discontinuities in the gold metallization at bonding pads and element contact areas. Reverse Engineering determined that the die held two amplifiers that used a splitter and combiner to parallel the amplifiers. The Reverse Engineered schematic was used to show how the discontinuous metallization affected the output power on each die. The discontinuous metallization was caused by photo resist shadowing during deposition of the metal layers. The location and the extent of the shadowing was a function of the position of the die on the wafer, and to a greater extent, the position of the wafer on the stationary platen in the deposition chamber. The process was thoroughly reviewed with the device manufacturer.
Tools Used: Electrical test equipment, mechanical decapsulation tools, Optical Microscopy, Scanning Electron Microscopy, Reverse Engineering, Manufacturer Interaction.
Resolution: Reeducate the technicians on their responsibilities for failure reporting. The manufacturer of the device must install process monitoring on the metallization process, and either reduce the number of wafers metallized in a given run or change to a planetary evaporator. A screening procedure must be established to locate good die within the existing lots so that production could run until replacements were available.