This is an unusual accident case study for several reasons, each of which bears further examination: (1) There was no pilot error, in fact, it was the pilot who saved the day. (2) The accident was caused by a design flaw in the aircraft itself. (3) The accident report got it wrong.

— James Albright

image

Updated:

2021-09-12

image

Captain Kevin Sullivan,
from The Sydney Morning Herald

This case study provides lessons on several fronts. First, it shows how a pilot with level head and the right experience base can turn a potentially disastrous circumstance into a safe emergency landing.

Secondly, it further illustrates how computer driven aircraft are susceptible to software glitches and that pilots should be given the tools to quickly diagnose, and to ultimately overrule those computers. (The classic "Airbus versus Boeing" debate.)

Finally, it shows how an accident board can be corrupted into ignoring root causes for political expedience. On that last point, the Australian Transport Safety Bureau has an exceptional record and I've always admired their reports. But this is the first I've read from them dealing with Qantas and I suspect there was pressure to be kind to Airbus, for unknown political reasons.

1 — Accident report

2 — Narrative

3 — Analysis

4 — Cause

5 — Postscript

I mentioned that I think the Australian Transport Safety Bureau got this one wrong, but they got it right when it came to Captain Kevin Sullivan. He was the right man for this job. This from a story in The Sydney Morning Herald from Qantas Flight 72 minutes after the aircraft's autopilot disconnects and the stall warning and overspeed warnings go off simultaneously and a loud crashing sound tears through the cabin of the aircraft:

  • In the cockpit, Sullivan instinctively grabs the control stick the moment he feels the plane's nose pitch down violently at 12.42pm (Western Australia time). The former US Navy fighter pilot pulls back on the stick to thwart the jet's rapid descent, bracing himself against an instrument panel shade. Nothing happens. So he lets go. Pulling back on the stick does not halt the plunge. If the plane suddenly returns control, pulling back might worsen their situation by pitching the nose up and causing a dangerous stall.
  • Within two seconds, the plane dives 150 feet. In a gut-wrenching moment, all the two pilots can see through the cockpit window is the blue of the Indian Ocean. "Is my life going to end here today?" Sullivan asks himself. His heart is thumping. Those on board QF72 are in dire trouble. There are no ejection seats like the combat jets Sullivan flew in the US Navy. He has no control over this plane.
  • "It's the worst thing that can happen when you are in an aeroplane – when you are not in control," he recalls. "And you have a choice. You can either succumb to that or you fight it. I was fighting that outcome – and have been ever since."
  • Eight years after QF72 dived towards the ocean, the Top Gun pilot nicknamed "Sully" since his teens is breaking his silence. "We're in an out-of-control aeroplane, we're all juiced up by our own bodies because, we thought, we are in a near-death situation, and we've got to be rocket scientists to figure out how we can go in there and land the plane outside of any established procedures," he says.
  • "We were never given any hint during our conversion course to fly this aeroplane that this could happen. And even, I think, the manufacturer felt this could never happen. It's not their intention to build an aeroplane that is going to go completely haywire and try and kill you."
  • The events of October 7, 2008, are not merely about how three Qantas pilots found themselves fighting to save a passenger plane from itself. It serves as a cautionary tale as society accelerates towards a world of automation and artificial intelligence.

Source: The Sydney Morning Herald

image

1

Accident report

  • Date: 7 October 2008
  • Time: 1242
  • Type: Airbus A330-303
  • Operator: Qantas
  • Registration: VH-QPA
  • Fatalities: 0 of 12 crew, 0 of 303 passengers
  • Aircraft fate: Minor
  • Phase: En route
  • Airport (departure): Singapore-Changi International Airport (SIN / WSSS)
  • Airport (arrival): Perth Airport, WA (PER / YPPH)

2

Narrative

Most of this comes from the Australian Transport Safety Bureau Report. It is fairly detailed and can get confusing. I've highlighted the important bits in amber and made several comments. The two pitch downs might seem relatively mild when compared to what they could have been, but there were a lot of unbelted passengers and crew who sustained considerable injuries. At least 110 of the 303 passengers and nine of the 12 crew were injured, 12 seriously.

At 0132 Universal Time Coordinated (UTC)1 (0932 local time) on 7 October 2008, an Airbus A330-303 aircraft, registered VH-QPA (QPA), departed Singapore on a scheduled passenger transport service to Perth, Western Australia. The captain was the pilot flying. By 0201, the aircraft was in cruise with autopilot 1 engaged and maintaining the aircraft’s altitude at flight level (FL) 370. The autothrust was engaged and managing engine thrust to maintain a cruising speed of Mach 0.82. The flight crew stated that the weather was fine and clear and there had been no turbulence during the flight.

Source: ATSB Report, ¶1.1.1

The captain was in the left seat, the second officer was in the right seat, and the first officer was on a rest break in the cabin.

At 0440:26, one of the aircraft’s three air data inertial reference units (ADIRU 1) started providing incorrect data to other aircraft systems. At 0440:28, the autopilot automatically disconnected, and the captain took manual control of the aircraft.

Source: ATSB Report, ¶1.1.2

Within 5 seconds of the autopilot disconnecting, a series of caution messages began appearing on the aircraft’s electronic centralized aircraft monitor (ECAM), each associated with a master caution chime. The crew also started receiving aural stall warnings and overspeed warnings, although each warning was only annunciated briefly. These cautions and warnings occurred frequently, and continued for the remainder of the flight.

Source: ATSB Report, ¶1.1.2

The crew canceled the autopilot disconnection warning message (AUTO FLT AP OFF) on the ECAM and then engaged autopilot 2. After canceling the autopilot message, they noticed a NAV IR 1 FAULT caution message on the ECAM, with an associated IR 1 fault light on the overhead panel. The ECAM was also displaying other caution messages at this time.

Source: ATSB Report, ¶1.1.2

In addition to the warnings and cautions, the crew reported that the airspeed and altitude indications on the captain’s primary flight display (PFD) were fluctuating. No such fluctuations were occurring on the first officer’s PFD or the standby flight instruments. The fluctuations on the captain’s PFD appeared to be based on unreliable information, as there was no other indication that the aircraft was actually near a stall or overspeed condition. Because the captain was unsure of the veracity of the information on his PFD, he used the standby instruments and the first officer’s PFD when flying the aircraft.

Source: ATSB Report, ¶1.1.2

For us non-Airbus pilots it would seem the problem is solved now. The problem appears to be with NAV IR 1 and we are on autopilot 2. But, as we shall see, the Airbus makes things much more complicated.

Data from the flight data recorder (FDR) showed that autopilot 2 was engaged for 15 seconds before being disconnected by the crew. (The captain reported that he disconnected autopilot 2 as it was a required action in the event of an unreliable airspeed situation. Although the captain’s airspeed values were fluctuating, there was no evidence of any problems with the other two airspeed sources during the flight.) The FDR also showed that, during the period between the initial autopilot 1 disconnection and when autopilot 2 was engaged, the aircraft’s altitude increased to 37,180 ft. During the short period when autopilot 2 was engaged, the aircraft started to return to the assigned level. Although the crew received numerous ECAM caution messages, none of them required urgent action, and none of them indicated any potential problems with the aircraft’s flight control system. However, the captain was not satisfied with the information that the aircraft systems were providing, and he asked the second officer to call the first officer back to the flight deck to help them diagnose and manage the problems.

Source: ATSB Report, ¶1.1.2

At 0442:27 (1242:27 local time), while the second officer was asking the cabin services manager (CSM) via the cabin interphone to send the first officer to the flight deck, the aircraft abruptly pitched nose down. The FDR showed that the pitch-down movement was due to a sudden change in the position of the aircraft’s elevators, and that the aircraft reached a maximum nose-down pitch angle of 8.4°. The flight crew described the pitch-down movement as very abrupt, but smooth. It did not have the characteristics of a turbulence-related event and the aircraft’s movement was solely in the pitching plane.

Source: ATSB Report, ¶1.1.3

There are some news reports that say this pitchdown was caused by the autopilot; that is not true. The autopilots were not engaged. The aircraft's flight control system caused the pitch down and overruled the pilot's inputs for a few seconds. Note also the "very abrupt, but smooth" description. It was anything but in the cabin:

Passengers and crew reported that the first upset occurred without any warning. They first noticed a sudden movement of the aircraft, generally described as a ‘drop’ or a ‘pitch-down’. A significant number of passengers and crew were not seated or were seated without seat belts fastened, and they were thrown upwards. There was a loud bang as most of these occupants hit the ceiling, followed by the sound of many passengers screaming. Many of the occupants were injured when they hit the ceiling, or when they landed back on the floor or seats. Several of the overhead storage compartments opened, and there were some reports that bags fell out. Loose objects were also thrown upwards and around the cabin.

Source: ATSB Report, ¶4.2.2

The FDR showed that the captain immediately applied back pressure on his sidestick to arrest the pitch-down movement. The aircraft’s flight control system did not initially respond to the captain’s sidestick input, but after about 2 seconds the aircraft responded normally and the captain commenced recovery to the assigned altitude. During this 2-second period the aircraft descended about 150 ft. Overall, the aircraft descended 690 ft over 23 seconds before returning to FL370.

Source: ATSB Report, ¶1.1.3

After returning the aircraft to FL370, the flight crew commenced actions to deal with multiple ECAM caution messages. These were:

  • NAV IR 1 FAULT: the required action presented on the ECAM was to switch the ATT HDG (attitude heading) switch from the NORM position to the CAPT ON 3 position. The crew completed the required action and cleared the message.
  • F/CTL PRIM 3 FAULT13: the required action was to select the flight control primary computer 3 (FCPC 3) OFF and then select it back ON. The crew completed the required action and cleared the message.
  • NAV IR 1 FAULT: this message had reappeared at the top of the ECAM with no additional required actions, and was canceled by the crew.
  • F/CTL PRIM 1 PITCH FAULT: there was no required action associated with this message. The crew confirmed that there was no associated fault light on the overhead panel and then started reviewing the flight control page on the ECAM’s system display.

Source: ATSB Report, ¶1.1.3

Note that the Airbus procedure leaves everything except the captain's attitude heading system where it was when the first upset occurred. Specifically, the flight control primary computer (PRIM) was turned off and then on again. (A computer reset.)

At 0445:08 (1245:08 local time), while the crew were responding to the ECAM messages, the aircraft commenced a second pitch-down movement, reaching a maximum pitch angle of about 3.5° nose down. The flight crew described the event as being similar in nature to the first event but less severe. The captain promptly applied back pressure on his sidestick to arrest the pitch-down movement. He said that, consistent with the first event, this action initially had no effect, but soon after the aircraft responded normally. FDR data showed that the flight control system did not respond to flight crew inputs for at least 2 seconds, and that the aircraft descended 400 ft over 15 seconds before returning to FL370.

Source: ATSB Report, ¶1.1.4

Following the second in-flight upset, the crew continued to review the ECAM messages and other indications. The first ECAM message they noticed was F/CTL ALTN LAW (PROT LOST). The next messages were recurrences of the NAV IR 1 FAULT and F/CTL PRIM 3 FAULT messages. The crew reported that the IR1 FAULT light and the PRIM 3 FAULT light on the overhead panel were illuminated. No other fault lights were illuminated.

Source: ATSB Report, ¶1.1.4

The crew reported that by this time ECAM messages were frequently scrolling, with each new caution message being placed at the top of the list. The NAV IR 1 FAULT message kept recurring, together with several other messages, such as NAV GPS FAULT, and they could not effectively interact with the ECAM to action and/or clear the messages. Master caution chimes associated with the ECAM messages were frequently occurring, together with aural stall warnings and overspeed warnings. The crew stated that these constant aural alerts, and the inability to silence them, were a significant source of distraction.

Source: ATSB Report, ¶1.1.4

At 0446:10, the captain made a public announcement to the cabin, advising that the crew were dealing with flight control problems, and telling everyone to remain seated with their seat belts fastened. The second officer contacted the cabin again by interphone to ask a flight attendant to send the first officer to the flight deck.

Source: ATSB Report, ¶1.1.5

The captain reported that, after the second upset event, he observed that the automatic pitch trim (autotrim) was not functioning and he began trimming the aircraft manually. The crew advised that, because the autotrim was not working, they thought the flight control system was in direct law.

Source: ATSB Report, ¶1.1.5

It is unclear if the aircraft was indeed in direct law, described here: Airbus Control Laws. That would have been good news, meaning the pilot's control inputs would take primacy over those of the flight control computers.

With the exception of the loss of autotrim, the captain reported that the aircraft was flying normally. At 0447:25, he disconnected the autothrust to minimize any potential problems associated with the erroneous air data information affecting the electronic engine control units. He then flew the aircraft without the autopilot or autothrust engaged, and using the standby instruments, for the remainder of the flight.

Source: ATSB Report, ¶1.1.5

The first officer returned to the flight deck at 0447:40, taking over from the second officer in the right control seat while the second officer moved to the third occupant seat. The crew discussed the situation, and the captain stated that he would continue flying the aircraft.

Source: ATSB Report, ¶1.1.5

The crew decided that they needed to land the aircraft as soon as possible. They were concerned that further pitch-down movements could occur, and they were aware that Learmonth was relatively close and that it was a suitable destination for an A330 landing. In addition, when the first officer returned to the flight deck, he advised the other flight crew that there had been some injuries in the cabin.

Source: ATSB Report, ¶1.1.5

At 0449:05, the first officer made a PAN broadcast to air traffic control, stating that they had experienced ‘flight control computer problems’ and that some of the aircraft’s occupants had been injured. He requested a clearance to divert to and track direct to Learmonth. The controller cleared the crew to descend to FL350.

Source: ATSB Report, ¶1.1.5

The captain told the second officer to obtain further information from the cabin while the captain and first officer started preparing for the descent. At 0450:40, the second officer contacted the flight attendant at the Left 1 door position in the cabin (section 4.1.1) to get further information on the extent of the injuries.

Source: ATSB Report, ¶1.1.5

At 0451:25, the first officer requested further descent from air traffic control. The controller cleared the crew to leave controlled airspace and proceed direct to Learmonth.

Source: ATSB Report, ¶1.1.5

After receiving advice from the cabin of several serious injuries, the captain asked the first officer to declare a MAYDAY. At 0454:25, the first officer declared the MAYDAY and advised air traffic control that they had multiple injuries on board.

Source: ATSB Report, ¶1.1.5

During the process of organizing the diversion to Learmonth, the flight crew again reviewed what had happened, their current situation, and the ECAM messages. They noted that the NAV IR 1 FAULT and F/CTL PRIM 3 FAULT messages were still occurring, together with several other caution messages. They concluded that the ECAM was not providing them with useful information or recommended actions. Consequently, at 0456:05, the first officer contacted the operator’s maintenance watch unit, located in Sydney, New South Wales, by a satellite communications system (SATPHONE) to brief them on the situation and to seek assistance.

Source: ATSB Report, ¶1.1.6

There were subsequently several communications between the flight crew and maintenance watch about the fault messages and other flight deck indications. In one discussion at 0510, maintenance watch advised the flight crew that ‘ADIRU 1’ appeared to be common to the fault messages being displayed, but that there was also some conflicting information regarding elevator control. They provided no recommended actions at that stage. In a subsequent discussion, maintenance watch recommended that, at the crew’s discretion, they could select PRIM 3 (or flight control primary computer 3) OFF. The crew discussed this recommendation and, at 0520, switched that computer OFF. This action had no effect on the scrolling ECAM messages, stall warnings or overspeed warnings.

Source: ATSB Report, ¶1.1.6

During the descent, the flight crew also had several communications with air traffic control regarding descent procedures, whether the approach and landing would be normal, whether there were any dangerous goods on board, and the availability of emergency services at Learmonth. The flight crew also had multiple communications with the cabin services manager, and made public announcements to the cabin regarding the situation and the diversion to Learmonth (section 4.2).

Source: ATSB Report, ¶1.1.6

In order to lose altitude for landing, the captain conducted a series of wide left orbits to maintain the aircraft’s speed below 330 kts (maximum operating speed). He reported that he descended cautiously in order to prevent any potential problems associated with another unexpected pitch-down event.

Source: ATSB Report, ¶1.1.6

The accident report severely downplays what I think is a crucial decision and a potential lesson for any pilots in similar situations. The The Sydney Morning Herald article captured it:

  • But before they can land, they have to check whether their flight control system is working properly. Flying over Learmonth, the wing flaps are extended as the pilots conduct two S-turns to confirm they are OK, and the landing gear is lowered. It is enough for Sullivan. He is desperate to get the plane on the ground. The extent of injuries will not be known until emergency services are on board.

Source: The Sydney Morning Herald

The "controllability check" is a staple of military aviation, designed to ensure the aircraft is maneuverable in the landing configuration at a safe altitude, before attempting it where the margin for error is reduced. The primary advantage is that if aircraft control cannot be maintained at normal approach speeds, pilots can determine how slow and how fully configured the aircraft can be while maintaining control. In Qantas 72's case, the aircraft was controllable but the exercise gave the pilots the confidence needed to make a normal approach.

The crew completed the approach checklist and conducted a flight control check above 10,000 ft. After further descent, the aircraft was positioned at about 15 NM (28 km) for a straight-in visual approach to runway 36. The precision approach path indicator was sighted at about 10 NM (16 km), and the aircraft landed at Learmonth at 0532 (1332 local time).

Source: ATSB Report, ¶1.1.6


3

Analysis

For those of us flying aircraft where the pilot has primacy over the automation, the Airbus electrical flight control system can seem nonsensical. It is important to keep in mind that there are situations where the pilot's inputs are ignored and that has to be a major consideration for any Airbus pilot trying to troubleshoot a flight control problem. The wording of the accident report can be misleading too. The report appears to bend over backwards when excusing the aircraft designers.

The A330’s electrical flight control system (EFCS) was a ‘fly-by-wire’ system. That is, there was no direct mechanical linkage between most of the flight crew’s controls and the flight control surfaces. Flight control computers sent movement commands via electrical signals to hydraulic actuators that were connected to the control surfaces. The computers sensed the response of the control surfaces to these commands, and adjusted the commands as required.

Source: ATSB Report, ¶1.6.3

The A330 EFCS had three flight control primary computers (FCPCs, commonly known as PRIMs) and two flight control secondary computers (FCSCs, commonly known as SECs). One of the FCPCs (normally FCPC 1) acted as the ‘master’ FCPC. It computed the appropriate control orders, and sent these orders to the other computers to action.

Source: ATSB Report, ¶1.6.3

The master FCPC computed the control orders according to a ‘control law’, with different functionality provided depending on the law being used. There were three levels of control law, and each level provided different functionality as follows:

  • Normal law. The EFCS detected when the aircraft was approaching the limits of certain flight parameters, and commanded control surface movements to prevent the aircraft from exceeding these limits (that is, it prevented the aircraft from exceeding a predefined safe flight envelope). Automatic flight-envelope protections included high AOA protection, load factor limitation, pitch attitude protection, roll attitude protection, and high speed protection.
  • Alternate law. The EFCS switched to alternate law if there were certain types or combinations of failures within the flight control system or related systems. Some types of protection, such as high AOA protection, were not provided, and others were provided using alternate logic.
  • Direct law. The EFCS switched to direct law in situations where there were more failures of relevant, redundant systems in addition to those that led to the reversion to alternate law. No flight-envelope protections were provided, and control surface deflection was proportional to sidestick and rudder pedal movement by the flight crew.

Source: ATSB Report, ¶1.6.3

For more about this, see: Airbus Control Laws.

The ADIRUs and Angle of Attack

The air data and inertial reference system (ADIRS) provided important information about the outside environment (such as air pressure and temperature), the aircraft’s state relative to the outside air (such as airspeed, altitude and angle of attack), and the aircraft’s state relative to the Earth (position, motion and orientation).

Source: ATSB Report, ¶1.6.4

To provide redundancy, the ADIRS included three air data inertial reference units (ADIRU 1, ADIRU 2, and ADIRU 3).

Source: ATSB Report, ¶1.6.4

Each ADIRU had two parts, an air data reference (ADR) part and an inertial reference (IR) part, which were integrated into a single unit. The two parts shared some common modules, such as the central processing unit module. In most cases, if one of the two parts failed, the other could still operate.

Source: ATSB Report, ¶1.6.4

Each ADIRU had its own, independent sensors. An AOA sensor and a total air temperature (TAT) probe provided data via analogue electrical signals directly to the ADIRU. In addition, a pitot probe and two static ports provided data to the ADIRU via air data modules (ADMs), which converted air pressure signals to digital signals.

Source: ATSB Report, ¶1.6.4

The ADR parameters used instantaneous measurements; that is, each measurement was completely independent of previous measurements. As a result, any corruption of the data did not have an ongoing effect on subsequent calculations.

Source: ATSB Report, ¶1.6.4

image

AOA Inputs to ADIRUs and FCPCs, ATSB, figure 15.

  • Each AOA sensor utilized two independent outputs (‘A’ and ‘B’). The relevant ADIRU compared the A and B signals and, if they agreed, processed the data.
  • To calculate ‘indicated AOA’, the ADIRU corrected the AOA vane angle for a 25° offset. To derive the ‘corrected AOA’, the ADIRU adjusted indicated AOA for the aircraft’s configuration (such as slats/flaps position). The corrected AOA value was passed on to other aircraft systems, and the output range could vary from -40 to +90°.
  • Each ADIRU sent its AOA outputs to all three FCPCs.
  • Under normal law on the A330, the EFCS protected the aircraft against a stall condition. If the EFCS detected that the aircraft’s AOA exceeded a threshold value, it would command a pitch-down movement using the aircraft’s elevators.
  • The aircraft’s flight warning system (FWS) also monitored the AOA data from the three ADIRUs. If it detected that the AOA was above a threshold value (which was different to that used by the EFCS), it would trigger an aural stall warning. In normal law, the threshold value of the warning was set at a high level of 23° to prevent unwarranted activations. In alternate or direct law, high AOA protection was lost and the stall warning was triggered when the highest of the valid AOA values exceeded the threshold for the flight conditions at the time.
  • In common with most aircraft types, AOA was not displayed to the flight crew.

Source: ATSB Report, ¶1.6.5

I'm not so sure how "common" it is to deprive the flight crew of AOA. In the hands of the naval aviator in the left seat, this information could have been critical.

What Went Wrong

  • AOA is a critically important flight parameter, and an aircraft with a full-authority flight control system (such as that on the A330 and A340) needs to be designed so that it obtains and uses accurate AOA information. The primary means of defense against an ADIRU providing incorrect AOA data to the FCPCs was the ADIRU itself, but this was not effective on the occurrence flight.
  • However, aircraft systems are designed with the expectation that technical faults will occasionally occur. Accordingly, the aircraft had three ADIRUs to provide redundancy and fault tolerance. Using the median of three values for a parameter as the system input is a common and generally robust algorithm, and the A330/A340 EFCS used this approach for most parameters. However, in order to address aerodynamic issues associated with the locations of the three AOA sensors, the FCPCs based the system input on the average value of AOA 1 and AOA 2. Nevertheless, they still used all three AOA values to check for consistency, as a basis for filtering out deviating values of AOA 1 and AOA 2, and for triggering a 1.2-second memorization period using the previous value if an errant value of AOA 1 or AOA 2 was detected.
  • The FCPC algorithm was generally very effective, and could deal with almost all possible situations involving incorrect AOA data being provided by one ADIRU. It could manage step-changes, runaways, single spikes, and most situations involving multiple spikes or intermittently incorrect data. For example, the ADIRU data-spike failure mode occurred on 12 September 2006 with spurious stall warnings (and therefore AOA spikes) occurring over a 30-minute period with no reported effect on the aircraft’s flightpath. On the 7 October 2008 flight, there were a large number of AOA spikes transmitted by ADIRU 1, and almost all of these were effectively filtered by the FCPCs.

Source: ATSB Report, ¶5.2.1

The accident report places the term "almost all" as if to say this is okay. If we were talking about a system that simply reported AOA to the pilot I suppose that would be true. But we are talking about a system that has priority over the pilot. If you said a pilot was safe "almost all" of the time, you might think twice before allowing that pilot to fly your passengers.

  • Nevertheless, the FCPC’s AOA algorithm could not effectively manage a scenario where there were multiple spikes such that one triggered a memorization period and another was present 1.2 seconds later. The problem was that, if a 1.2-second memorization period was triggered, the FCPCs accepted the next values of AOA 1 and AOA 2 after the end of the memorization period as valid. In other words, the algorithm did not effectively handle the transition from the end of a memorization period back to the normal operating mode when a second data spike was present.

Source: ATSB Report, ¶5.2.1

image

QAR plot showing key ADIRU parameters, ATSB, figure 21.

As best I can decipher this: the ADIRUs average all three AOA values as a way of filtering out deviating values. The ADIRUs also memorize data for 1.2 seconds as a way of having something available to check against the next set of data being wrong (the data spike). But there is a problem. If you get a data spike during one 1.2 second period and another at least 1.2 seconds later, so that it will be occur after a good 1.2 seconds of data, no worries. But if you get another data spike in the next 1.2 second packet of information, the system is fooled into thinking the data spike is actually good data.

  • A subsequent review of the FCPC algorithm for processing AOA data identified a very specific (and unintended) scenario in which incorrect AOA data from only one of the aircraft’s three air data inertial reference units (ADIRUs) could trigger a pitch-down command. The scenario required two AOA spikes, with the second being present 1.2 seconds after the start of the first.
  • Two minutes before the first pitch-down, ADIRU 1 started outputting spikes in AOA data (and other ADIRU parameters), and the spikes were present at the time of both pitch-downs.
  • Simulations by the aircraft manufacturer confirmed that AOA spikes of the magnitudes recorded during the flight could initiate the elevator movements observed during the pitch-downs.

Source: ATSB Report, ¶5.1

image

FDR plot showing recorded AOA spikes for both pitch-downs, ATSB, figure 23.

  • As the captain’s AIR DATA switch remained set to the NORM position, the source of AOA 1 data was always ADR 1. Key results for the AOA 1 data were as follows:
    • The FDR recorded 42 AOA 1 spikes during the period from 0440:26 until the aircraft landed at Learmonth.
    • The first AOA 1 spike occurred at 0440:34. AOA 1 values changed from 2.1° to 50.625° and back to 2.1° over three successive (1-second) samples.
    • One of the recorded AOA spikes occurred at 0442:26, immediately prior to the first pitch-down (0442:27). The AOA value was 50.625°.
    • Another of the recorded AOA spikes occurred at 0445:08, immediately prior to the second pitch-down (0445:09). The AOA value was 50.625°.
    • Most of the AOA spikes that were recorded by the FDR were 50.625° in magnitude and occurred during the initial 12 minutes after the first spike. Two other values (16.875° and 5.625°) were recorded later in the flight.
  • Only a small number of situations could trigger a master warning, including an involuntary (uncommanded) autopilot disconnection, stall, overspeed, engine fire, or a high cabin altitude. The first master warning on the occurrence flight was due to the involuntary autopilot disconnection while the later master warnings were due to either stall warnings or overspeed warnings. Although there was a general correlation between the times of the master warnings and the stall or overspeed warnings, it was not exact. This result was explained by the sampling rates and the computation time of the FWS, the sampling rates of the acquisition unit for the FDR, and the brevity of the stall and overspeed warnings.

Source: ATSB Report, ¶1.11.4

The FDR data showed problems with AOA 1 (as well as other data from ADIRU 1). Accordingly, the AOA 1 sensor (serial number 0861ED-972) was removed from the aircraft following the occurrence and tested by an authorized agency. No fault was found with the sensor and all test parameters were within limits. The wiring between the AOA 1 sensor and the ADIRU was also tested, and no problems were identified.

Source: ATSB Report, ¶1.12.10

It appears the AOA and ADIRU were working as designed.

Other ADIRU data-spike occurrences

  • [A] 12 September 2006 occurrence involved the same aircraft (QPA) and the same ADIRU (4167) as the 7 October 2008 occurrence. No recorded data was available for that flight other than the PFR. However, the PFR and the flight crew’s description of numerous warning and caution messages provided sufficient evidence to conclude that the occurrence involved similar ADIRU data output anomalies as that which occurred on the 7 October 2008 flight. Following the 2006 flight, line testing of the ADIRU was conducted in accordance with the manufacturer’s procedures; no fault was found and the aircraft was returned to service.
  • [A] 27 December 2008 occurrence involved another of the operator’s A330-303 aircraft, registered VH-QPG (QPG), and a different LTN-101 ADIRU (serial number 4122). In that occurrence, the crew reported receiving numerous caution messages and that the messages on the ECAM were constantly changing. The crew followed a new procedure that was introduced after the 7 October 2008 occurrence, which was successful in shutting down the ADR part 28 seconds after the failure mode started (and 24 seconds after the autopilot disconnected). The new procedure was not effective in shutting down the IR part. The FDR and QAR data showed evidence of data spikes on all IR parameters and some of the ADR parameters (in the 28 seconds during which it operated in the failure mode). General observations about the spike timing and magnitudes were similar to those for the 7 October 2008 event. In addition to data spikes, the IR data for the 27 December 2008 event also showed similar patterns as the IR data on the 7 October 2008 flight. ADIRU 4122 was removed from the aircraft and sent to the manufacturer’s facilities for examination and testing and to download the BITE data. The testing did not identify any faults. The BITE data was very similar to that recovered from unit 4167 after the 7 October 2008 occurrence. That is, there were no faults recorded during the occurrence flight, and several routine messages normally stored in BITE were not recorded.
  • In both the 12 September 2006 and the 27 December 2008 occurrences there was no effect on the aircraft’s flight controls, and consequently there were no FCPC (PRIM) faults or PITCH faults. The flight crew of the 12 September 2006 flight discussed their event with maintenance watch, and completed a technical log entry. At the time, the event appeared to be primarily associated with nuisance ECAM messages. There was no autopilot disconnection, and no effect on the flight control system. The crew turned the ADIRU off in flight, and the anomalous behavior ceased. Line maintenance personnel realigned the unit and conducted a system test, and no further problems were identified.

Source: ATSB Report, ¶1.16.2

The fact Airbus released a new procedure to shut down ADR part of the ADIRU on 15 October 2008, just a week after Qantas 72, reveals there was indeed a problem with the design of the electrical flight control system. The accident report's pronouncement that "there was no effect on the aircraft's flight controls" is misleading, in that without the change in procedure there could have been a significant effect.

In theory, the 12 September 2006 occurrence provided an opportunity for the ADIRU data-spike failure mode and the design limitation of the A330/A340 flight control system to be identified before the 7 October 2008 occurrence. In reality, based on the available information, there were good reasons for not conducting any further investigation at the time.

Source: ATSB Report, ¶5.5.1.

There have been no other accidents on A330/A340 aircraft associated with the flight control system providing pitch-down commands in response to incorrect ADIRU data.

Source: ATSB Report, ¶1.16.3

  • The FCPC algorithm for processing AOA data was able to detect and manage almost all situations involving incorrect or inconsistent AOA data being sent from the ADIRUs as ‘valid’ data.
  • A significant step-change of either AOA 1 or AOA 2 would trigger a 1.2-second memorization period and have no effect on AOAFCPC input. If the change lasted for more than 1 second, the FCPC would reject the relevant ADR and, following the 1.2-second memorization period, AOAFCPC input would be based on the average of AOA 3 and the remaining AOA value.
  • A step-change less than the monitoring threshold would have a constant but very minor effect on AOAFCPC input computations.
  • A single, short-duration spike in AOA 1 or AOA 2 would trigger a 1.2-second memorization period, with the last valid AOAFCPC input used during that period. Following the memorization period, AOAFCPC input would again be based on current values of AOA 1 and AOA 2.
  • The occurrence of any other spikes during the 1.2-second memorization period would not trigger another memorization period. The occurrence of a spike after the end of the memorization period would trigger a new memorization period.
  • Following the 7 October 2008 occurrence, the aircraft manufacturer identified a scenario in which deviations in only one of the three AOA values could significantly influence AOAFCPC input. The scenario involved two spikes in AOA 1 (or AOA 2) with the following properties:
    • The first spike was different to the median of the three AOA values, triggering a 1.2-second memorization period. This spike lasted for less than 1 second.
    • The second spike was present at the end of the memorization period (1.2 seconds after the start of the first spike).
  • If this scenario occurred, the FCPC would treat the second AOA spike as valid data after the end of the memorization period.

Source: ATSB Report, ¶2.1.6

Two of the EFCS’s flight envelope mechanisms could respond to high AOAFCPC input values and initiate a nose-down elevator command: high AOA protection and anti pitch-up compensation. If both corrective mechanisms were triggered at the same time, their contributions were added.

Source: ATSB Report, ¶2.1.7

  • The aircraft manufacturer conducted a series of simulations to determine the nature of the factors that could have contributed to the pitch-downs on the 7 October 2008 flight. These activities were performed using an engineering simulation tool developed during the aircraft development process.
  • This study confirmed that the recorded elevator deflections during the first pitch-down could be produced using a 26° value of AOAFCPC input for a duration of about 400 msec at the end of the memorization period (that is, the value of X was 400 msec). Based on this simulation, it could be concluded that the AOA 1 spike of 50.6° lasted at least 400 msec but less than 1 second.
  • The same process was used to determine the AOA values required to replicate the second pitch-down (at 0445:08). A single memorization period with different spike values and durations did not match the recorded elevator movements. Further studies identified a more complex scenario that did match these movements, and this scenario involved at least four spikes in succession and triggering two memorization periods in AOAFCPC input.
  • The simulation studies showed that spikes in AOA 1 values 1.2 seconds apart could lead to the FCPCs sending pitch-down commands to the elevators, and that these commands were consistent with the elevator deflections observed during the two pitch-downs. In addition, the studies confirmed that flight crew inputs and turbulence did not contribute to the pitch-down commands.

Source: ATSB Report, ¶2.2.1

  • In terms of reasons why the failure scenario was not identified, the aircraft manufacturer noted that the FMEA supplied by the LTN-101 ADIRU manufacturer did not identify the type of ADIRU failure mode that occurred on the 7 October 2008 flight (that is, multiple, undetected spikes in AOA and other ADR parameters).
  • The manufacturer also stated that, prior to and during the design process in 1991-1992, it was aware that AOA spikes could occur on many flights, but in its experience only a very small number of spikes (if any) occurred on any particular flight. Prior to the 7 October 2008 occurrence, it was not aware of any events involving the ADIRU data-spike failure mode or similar ADIRU behavior Between 1992 and 2009, A330/A340 aircraft conducted over 28 million flying hours and the 7 October 2008 occurrence was the only known example where multiple AOA spikes from one ADIRU had led to an undesired pitch-down command.

Source: ATSB Report, ¶2.5.4

The manufacturer was aware of the problem but considered it to be rare enough of a occurrence to ignore. But they failed to provide pilots with a warning of the problem or a way to avoid an inflight upset in the event the problem did happen.

  • There was a 2-minute period between the commencement of the ADIRU failure and the flight control system’s first pitch-down command. The only flight crew action that would have prevented this pitch-down command was to select the ADR part of ADIRU 1 OFF.

Source: ATSB Report, ¶5.5.3

This procedure was mandated just a week after the Qantas 72 episode.

  • With the information available to the flight crew at the time of the occurrence, it was not reasonable to expect that they, or most crews in the same situation, would have made that selection in the available time. With very few exceptions, abnormal and emergency procedures on the A330 were carried out in response to ECAM messages. There was no ECAM message advising the flight crew of a NAV ADR 1 FAULT, or otherwise requiring the crew to select the ADR OFF. There were also no other procedures available for the situation they were experiencing.
  • If no specific action was recommended by the ECAM, or provided in other procedures, then the crew needed to evaluate the situation carefully before responding. The autopilot had disconnected, and there were nuisance stall and overspeed warnings, numerous caution messages on the ECAM, and problems with the air data information that were being provided on the captain’s primary flight display. The crew had not encountered such a situation before, and the underlying source of all the problems was not obvious. More importantly, based on their available systems knowledge, they had no reason to suspect that any of the problems they were experiencing were associated with a condition that could adversely influence the flight control system.
  • The flight crew on the 12 September 2006 flight did select the ADR OFF. However, they only took this action after 30 minutes, and only after they detected an intermittent ADR 1 fault light on the overhead panel. On the 7 October 2008 flight, no fault light was illuminated. The post-flight report did indicate that a NAV ADR 1 FAULT was recorded 33 minutes after the ADIRU failure, but the amount of time that it was displayed on the ECAM was not clear. More importantly, it was not present prior to the two pitch-down commands.
  • The flight crew of the 27 December 2008 flight involving similar, anomalous ADIRU behavior had new procedures available from the aircraft manufacturer as a result of the 7 October 2008 occurrence. When presented with the ADIRU failure mode, the crew promptly executed the procedures and selected the ADR OFF after 28 seconds, and this removed the potential for an inadvertent pitch-down associated with the incorrect ADIRU data.

Source: ATSB Report, ¶5.5.3

  • The captain’s sidestick responses to both pitch-downs were prompt. The FCPC pitch-down commands were each present for about 2 seconds, and the captain’s response to both occurred prior to the FCPCs being able to respond to these inputs. Given that the situation was sudden and unexpected, there was a risk that the flying pilot could have over corrected (that is, provided an excessive sidestick response), which would have led to more severe vertical accelerations during the recovery. However, in addition to being timely, the captain’s sidestick responses were also of the appropriate magnitude.
  • After the first pitch-down, the flight crew were presented with a situation that was even more confusing and much more serious, with several aircraft systems not functioning correctly. They continued to evaluate the available information and followed the recommended ECAM actions, but none of these actions were effective in preventing the second pitch-down.
  • Given the nature of their situation, the crew’s decision to divert to Learmonth was appropriate. They had good reasons to consider that further undesired pitch-down commands could occur, and it was therefore safest to get the aircraft on the ground as soon as possible. The diversion was also the quickest way to get medical assistance to the seriously injured passengers and crew. Learmonth was relatively close to the aircraft’s position at the time, and it was a suitable aerodrome for an A330 landing.
  • Following the decision to divert, the crew continued their attempts to diagnose the problems, and also requested assistance from the operator’s maintenance watch. These efforts did not resolve the situation and, as the flight progressed, they needed to focus more of their efforts on identifying and managing the logistical issues and threats associated with the diversion and landing. They also needed to maintain communications with the cabin, air traffic control and maintenance watch. These tasks were performed with a high degree of coordination and effectiveness by the flight crew.
  • Although the decision to divert was made at 0447, the aircraft did not land until 0532. This period of time was consistent with the time needed to cautiously descend the aircraft to minimize problems associated with another pitch-down.

Source: ATSB Report, ¶5.5.4


4

Cause

In my view, the accident report got the causes completely wrong. The only "significant safety issue" was the faulty algorithm. It wasn't the software that nearly killed 315 people. I don't think you've come to the right conclusion in an aircraft accident investigation until you find the root cause of an accident, and to do that, you have to use the word "fail" next to a human element. A good way to decipher an accident report that makes this error is to look at the recommendations, which I've also listed here. Based on these, I would rewrite the list of causes as follows:

  1. Airbus failed to fully test the electrical flight control system to anticipate spurious data inputs to flight control system that could result in a critical flight attitude.
  2. Airbus failed to provide pilots with the necessary indications to diagnose a faulty flight control system situation or the procedures to deal with them.

As the big airplane world gravitates to a binary choice of Airbus and Boeing aircraft, finding a design flaw in one will necessarily tip the balance to the other. You will be hard pressed, for example, to find deliberate discussion about the battery fire problem on some Boeings. But you are even less likely to find anyone willing to look at the very fundamental issue of computer primacy over pilots at the very core of Airbus design. It is a heated battle, to be sure. But it seems to me there is enough evidence to at least bring up the question.

  • There was a limitation in the algorithm used by the A330/A340 flight control primary computers for processing angle of attack (AOA) data. This limitation meant that, in a very specific situation, multiple AOA spikes from only one of the three air data inertial reference units could result in a nose-down elevator command. [Significant safety issue]
  • When developing the A330/A340 flight control primary computer software in the early 1990s, the aircraft manufacturer’s system safety assessment and other development processes did not fully consider the potential effects of frequent spikes in the data from an air data inertial reference unit. [Minor safety issue]

Source: ATSB Report, ¶6.1

  • On 15 October 2008, the aircraft manufacturer issued Operations Engineering Bulletin (OEB) OEB-A330-74-1, which was applicable to all A330 aircraft fitted with Northrop Grumman ADIRUs.201 The OEB stated that, in the event of a NAV IR [1, 2 or 3] FAULT (or an ATT red flag being displayed on either the captain’s or first officer’s primary flight display), the flight crew were required to select the air data reference (ADR) part of the relevant ADIRU OFF and then select the relevant inertial reference (IR) part of the relevant ADIRU OFF. The problem was described as a ‘significant operational issue’ and operators were advised to inform their pilots of the OEB without delay and insert the procedure in the Flight Crew Operations Manual. A compatible temporary revision was issued to the Minimum Master Equipment List at the same time.
  • The OEB procedure was subsequently amended in December 2008 to cater for a situation where the IR and ADR pushbuttons were selected OFF and the OFF lights did not illuminate. The new OEB (A330-74-3) required crews to select the IR mode rotary selector to the OFF position if the lights did not illuminate.
  • Following the 27 December 2008 occurrence, the aircraft manufacturer issued another OEB (A330-74-4, 4 January 2009). This OEB provided a revised procedure for responding to a similar ADIRU-related event to ensure incorrect data would not be used by other aircraft systems. The procedure required the crew to select the relevant IR OFF, select the relevant ADR OFF, and then turn the IR mode rotary selector to the OFF position.

Source: ATSB Report, ¶7.1.1, Procedural changes issued by Airbus


5

Postscript

Over the years since this incident, there is a possibility the ADIRU problem was caused by a "Cosmic Bit Flip." The following video does a great job describing the issue and gets into Qantas 72 very briefly starting at 15:11. But the entire video is worth watching:

References

(Source material)

In-flight upset 154 km west of Learmonth, WA, 7 October 2008, VH-QPA, Airbus A330-303, Australian Transport Safety Bureau (ATSB) AO-2008-070, December 2011.

O'Sullivan, Matt, The untold story of QF72: What happens when 'psycho' automation leaves pilots powerless?, The Sydney Morning Herald.