Power8 System Firmware

Applies to:   9119-MHE and 9119-MME.

This document provides information about the installation of Licensed Machine or Licensed Internal Code, which is sometimes referred to generically as microcode or firmware.


Contents


1.0 Systems Affected

This package provides firmware for Power System E880 (9119-MHE) and Power System E870 (9119-MME) servers only.

The firmware level in this package is:

1.1 Minimum HMC Code Level

This section is intended to describe the "Minimum HMC Code Level" required by the System Firmware to complete the firmware installation process. When installing the System Firmware, the HMC level must be equal to or higher than the "Minimum HMC Code Level" before starting the system firmware update.  If the HMC managing the server targeted for the System Firmware update is running a code level lower than the "Minimum HMC Code Level" the firmware update will not proceed.

The Minimum HMC Code level for this firmware is:  HMC V8 R8.2.0 Service Pack 1  (PTF MH01455).

Although the Minimum HMC Code level for this firmware is listed above,  updating or upgrading to one of the following levels is recommended.

                   HMC V8 R8.2.0 Service Pack 2 (PTF MH01488) with fix (PTF MH01624) or higher.
                                                                                             -OR-
                   HMC V8 R8.3.0 Service Pack 2 (PTF MH01584) with fix (PTF MH01625) or higher.

Note: Updating the HMC to V8 R8.2.0 Service Pack 1 is required prior to installing this firmware.  Details on this requirement can be found in the firmware information description table.

For information concerning HMC releases and the latest PTFs,  go to the following URL to access Fix Central:
http://www-933.ibm.com/support/fixcentral/

For specific fix level information on key components of IBM Power Systems running the AIX, IBM i and Linux operating systems, we suggest using the Fix Level Recommendation Tool (FLRT):
http://www14.software.ibm.com/webapp/set2/flrt/home

NOTES:
                -You must be logged in as hscroot in order for the firmware installation to complete correctly.
                - Systems Director Management Console (SDMC) does not support this System Firmware level.

1.2 AIX iFix Required

For IBM Power System servers with the PCIe 2-port Async EIA-232 Adapter installed on AIX partitions, an AIX fix resolving the async port interrupt handling (APAR IV77596) must be installed before updating to the SC820_091 (FW820.30) or later level of firmware.  The ports on the adapter (feature code EN27/EN28, CCIN 57D4) may become un-usable with the installation of that firmware level due to an issue with how interrupts are handled.  Many JAS_RTS error log entries are written to the error log due to this issue.

Prior to this APAR shipping in a future Service Pack, AIX intends to publish ifixes for the latest Service Packs on all active Technology Levels on our ftp server, in ftp://aix.software.ibm.com/aix/ifixes/iv77596/ on or before Oct 13, 2015.  If you need an ifix other than the ones on this server, contact IBM support to request one for your specific situation.

The procedure is intended to be performed by the customer.  In the event that the customer has questions or concerns with the procedure, you should contact IBM Support.  Please contact IBM Support: 
US Support: 1.800.IBM.SERV
WW Support (select your country):  http://www.ibm.com/planetwide/

2.0 Important Information

Recently, several enhancements were released to improve the reliability and function of new and existing adapters used on Power8 systems. To ensure the highest level of availability and performance, it is important that the following System Firmware, IO, AIX & VIOS maintenance is performed.  For efficiency, IBM recommends that all applicable System Firmware, IO, AIX & VIOS maintenance is consolidated and performed during the same session to reduce the number of scheduled maintenance windows.

System F/W: SC820_048 / FW820.02 (or higher)
- For systems in PowerVM mode, a problem was fixed for unresponsive PCIe adapters after a partition power off or a partition reboot.

I/O:
- Device: PCIe2 4-Port (10GbE SFP+ & 1GbE RJ45) Adapter
   Feature Codes: EN0S EN0T EN0U EN0V
   Version: 30090140 (or higher)
   An enhancement added to support Network Installation on 1GB speed switch ports.

- Device: PCIe2 2-Port 10GbE Base-T Adapter
   Feature Codes: EN0W EN0X
   Version: 20110140 (or higher)
   Fixes a Network Installation issue seen with 1GB speed switch port setting.

AIX/VIOS:
- VIOS 2233/61 TL09 SP3: IV63449
- AIX 71 TL03 SP03        :  IV63680

For Power8 systems using NIC adapter Feature Codes (FC) EN0U, EN0V, EN0S, EN0T, EL3Z, EN0W, EN0X which translate to:
PCIe2 4-Port Adapter (10GbE SFP+)
PCIe2 4-Port Adapter (1GbE RJ45)
PCIe2 2-Port 10GbE Base-T Adapter

These APARs correct a problem that occurs when promiscuous mode is not set when the adapter gets reset (e.g. when adapter becomes backup in SEA fail over mode or Encounters a transmit error). This would cause the adapter to transmit packet but not receive packets.

Downgrading firmware from any given release level to an earlier release level is not recommended.

If you feel that it is necessary to downgrade the firmware on your system to an earlier release level, please contact your next level of support.

IPv6 Support and Limitations

IPv6 (Internet Protocol version 6) is supported in the System Management Services (SMS) in this level of system firmware. There are several limitations that should be considered.

When configuring a network interface card (NIC) for remote IPL, only the most recently configured protocol (IPv4 or IPv6) is retained. For example, if the network interface card was previously configured with IPv4 information and is now being configured with IPv6 information, the IPv4 configuration information is discarded.

A single network interface card may only be chosen once for the boot device list. In other words, the interface cannot be configured for the IPv6 protocol and for the IPv4 protocol at the same time.

Concurrent Firmware Updates

Concurrent system firmware update is only supported on HMC Managed Systems only.

Memory Considerations for Firmware Upgrades

Firmware Release Level upgrades and Service Pack updates may consume additional system memory.
Server firmware requires memory to support the logical partitions on the server. The amount of memory required by the server firmware varies according to several factors.
Factors influencing server firmware memory requirements include the following:
Generally, you can estimate the amount of memory required by server firmware to be approximately 8% of the system installed memory. The actual amount required will generally be less than 8%. However, there are some server models that require an absolute minimum amount of memory for server firmware, regardless of the previously mentioned considerations.

Additional information can be found at:
http://www-01.ibm.com/support/knowledgecenter/9119-MHE/p8hat/p8hat_lparmemory.htm


3.0 Firmware Information

Use the following examples as a reference to determine whether your installation will be concurrent or disruptive.

For systems that are not managed by an HMC, the installation of system firmware is always disruptive.

Note: The concurrent levels of system firmware may, on occasion, contain fixes that are known as Deferred and/or Partition-Deferred. Deferred fixes can be installed concurrently, but will not be activated until the next IPL. Partition-Deferred fixes can be installed concurrently, but will not be activated until a partition reactivate is performed. Deferred and/or Partition-Deferred fixes, if any, will be identified in the "Firmware Update Descriptions" table of this document. For these types of fixes (Deferred and/or Partition-Deferred) within a service pack, only the fixes in the service pack which cannot be concurrently activated are deferred.

Note: The file names and service pack levels used in the following examples are for clarification only, and are not necessarily levels that have been, or will be released.

System firmware file naming convention:

01SCxxx_yyy_zzz

NOTE: Values of service pack and last disruptive service pack level (yyy and zzz) are only unique within a release level (xxx). For example, 01SC820_040_040 and 01SC820_040_045 are different service packs.

An installation is disruptive if:

            Example: Currently installed release is 01SC820_040_040, new release is 01SC830_050_050.

            Example: SC820_040_040 is disruptive, no matter what level of SC820 is currently installed on the system.

            Example: Currently installed service pack is SC820_040_040 and new service pack is SC820_050_045.

An installation is concurrent if:

The release level (xxx) is the same, and
The service pack level (yyy) currently installed on the system is the same or higher than the last disruptive service pack level (zzz) of the service pack to be installed.

Example: Currently installed service pack is SC820_040_040, new service pack is SC820_071_040.

3.1 Firmware Information and Description

 
Filename Size Checksum
01SC820_099_047.rpm
73168049
53811

Note: The Checksum can be found by running the AIX sum command against the rpm file (only the first 5 digits are listed).
ie: sum 01SC820_099_047.rpm

SC820
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs

The complete Firmware Fix History for this Release Level can be reviewed at the following url:
http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/SC-Firmware-Hist.html
SC820_099_047 / FW820.40

05/04/16
Impact:  Availability      Severity:  SPE

New Features and Functions

  • Support was added for the Stevens6+ option of the internal tray loading DVD-ROM drive with F/C #EU13.  This is an 8X/24X(max) Slimline SATA DVD-ROM Drive.  The Stevens6+ option is a FRU hardware replacement for the Stevens3+.  MTM 7226-1U3 (Oliver)  FC 5757/5762/5763 attaches to IBM Power Systems and lists Stevens6+ as optional for Stevens3+.  If the Stevens6+  DVD drive is installed on the system without the required firmware support, the boot of an AIX partition will fail when the DVD is used as the load source.  Also, an IBM i partition cannot consistently boot from the DVD drive using D-mode IPL.  A SRC C2004130 may be logged for the load source not found error.

System firmware changes that affect all systems

  • A problem was fixed for a system IPL hang at C100C1B0 with SRC 1100D001 when the power supplies have failed to supply the necessary 12-volt output for the system.   The 1100D001 SRC was calling out the planar when it should have called out the power supplies.  With the fix, the system will terminate as needed and call out the power supply for replacement.  One mode of power supply failure that could trigger the hang is sync-FET failures that disrupt the 12-volt output.
  • A problem was fixed for the callout of a VPD collection fault and system termination with SRC 11008402 to include the 1.2vcs VRM FRU.  The power good fault fault for the 1.2 volts would be a primary cause of this error.  Without the fix, the VRM is missing in the callout list and only has the VPDPART isolation procedure.
  • On multi-node systems with a power fault, a problem was fix for On-Chip Controller errors caused by the power fault being reported as predictive errors for SRC B1602ACB.  These have been corrected to be informational error logs.  If running without the fix, the predictive and unrecoverable errors logged for the OCC on loss of power to the node can be ignored.
  • A problem was fixed for excessive logging of the SRC 11002610 on a power good (pgood) fault when detected by the Digital Power Subsystem Sweep (DPSS).  Multiple pgood interrupts are signaled by the DPSS in the interval between the first pgood failure and the node power down.  A threshold was added to limit the number of error logs for the condition.
  • A problem was fixed for redundant logging of the SRC B1504804 for a fan failure, once every five seconds.  With the fix, the failure is logged only at the initial time of failure in the IPL.
  • A problem was fixed for a false unrecoverable error (UE) logged for B1822713 when an invalid cooling zone is found during the adjustment of the system fan speeds.  This error can be ignored as it does not represent a problem with the fans.
  • On a multi-node system,  a problem was fixed for a power fault with SRC 11002610 having incorrect FRU callouts.  The wrong second FRU callout is made on nodes 2, 3, and 4 of a multi-node system.  Instead of calling out the processor FRU, the enclosure FRU is called out.  The first FRU callout is correct.
  • A problem was fixed for a processor clock failover error with SRC B158CC62 calling out all processors instead of isolating to the suspect processor.  The callout priority correctly has a clock and a procedure callout as the highest priority, and these should be performed first to resolve the problem before moving on to the processors.
  • A problem was fixed for a system checkstop caused by a L2 cache least-recently used (LRU) error that should have been a recoverable error for the processor and the cache.  The cache error should not have caused a L2 HW CTL error checkstop.
  • A problem was fixed for priority callouts for system clock card errors with SRC B158CC62.  These errors had high priority callouts for the system clock card and medium callouts for FRUs in the clock path.  With the fix, all callouts are set to medium priority as the clock card is not the most probable FRU to have failed but is just a candidate among the many FRUs along the clock path.
  • A problem was fixed for PCIe switch recovery to prevent a partition switch failure during the IPL with error logs for SRC B7006A22 and B7006971 reported.  This problem can occur when doing recovery for an informational error on the switch.  If this problem occurs, the partition must be restarted to recover the affected I/O adapters.
  • A problem was fixed to correct the error messages for early failures in the Live Partition Mobility (LPM) migration of a partition.  The management console might report an unrelated error such as  "HSCLA27E The operation to lock the physical device location for target adapter" when the actual error might be not enough available memory on the target CEC to run the migration.  With the fix, the correct error code is returned so there is enough information to correct the error and retry the migration.
  • A problem was fixed for a hypervisor task hang during a FRU exchange on the PCIe3 I/O expansion drawer (#EMX0) that requires the entire drawer to power off and power on again.  The activation phase for the power on may never complete if a very rare sequence of events occurs during the power on step.  The FRUs to exchange that would cause the expansion drawer to power off  and power on are the following:  midplane, I/O module, I/O module VRM, chassis management card (CMC), cable card, and active optical cable.
  • A problem was fixed for PCIe adapter hangs and network traffic error recovery during Live Partition Mobility (LPM) and SR-IOV vNIC (virtual ethernet adapter)  operations.  An error in the PCI Host Bridge (PHB) hardware can persist in the L3 cache and fail all subsequent network traffic through the PHB.  The PHB  error recovery was enhanced to flush the PHB L3 cache to allow network traffic to resume.
  • A problem was fixed for a Qualys network scan for security vulnerabilities causing a core dump in the Intelligent Platform Management Interface (IPMI)  process on the service processor with SRC B181EF88.  The error occurs anytime the Qualys scan is run because it sends an invalid IPMI session id that should have been handled and discarded without a core dump.
  • A problem was fixed for error recovery from failed Live Partition Mobility (LPM) migrations.  The recovery error is caused by a partition reset that leaves the partition in an unclean state with the following consequences:  1) A retry on the migration for the failed source partition may not not be allowed; and 2) With enough failed migration recovery errors, it is possible that any new migration attempts for any partition will be denied.  This error condition can be cleared by a re-IPL of the system. The partition recovery error after a failed migration  is much more likely to occur for partitions managed by the Integrated Virtualization Manager (IVM) but it is still possible to occur for Hardware Management Console (HMC) managed partitions.
  • A problem was fixed for a L2 cache error on the service processor that caused the service processor to reset or go to a failed state with SRC B1817212 on systems with a single service processor.  On systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.  With the fix, the L2 cache error is handled with single-bit corrected with no error to the service processor, so it can continue normal processing.  The L2 cache data error that causes this fail is infrequent and the service processor requires its limit of three resets in fifteen minutes to be exceeded for the service processor to fail, so service processor failure rate for this problem is low.
  • A security problem was fixed in OpenSSL for a possible service processor reset on a null pointer de-reference during RSA PPS signature verification. The Common Vulnerabilities and Exposures issue number is CVE-2015-3194.
  • A security problem was fixed in the lighttpd server on the service processor, where a remote attacker, while attempting authentication, could insert strings into the lighttpd server log file.  Under normal operations on the service processor, this does not impact anything because the log is disabled by default.  The Common Vulnerabilities and Exposures issue number is CVE-2015-3200.
  • A problem was fixed for a hypervisor adjunct partition failed with "SRC B2009008 LP=32770" for an unexpected SR-IOV adapter configuration.  Without the fix, the system must be re-IPLed to correct the adjunct error.  This error is infrequent and can only occur if an adapter port configuration is being changed at the same time that error recovery is occurring for the adapter.
  • A problem was fixed for a missing error log when a clock card fails over to the backup clock card.  This problem causes loss of redundancy on the clock cards without a callout notification that there is a problem with the FRU.  If the fix is applied to a system that had a failed clock, that condition will not be known until the system is IPLed again when a errorlog and callout of the clock card will occur if it is in a persisted failed state.
  • A problem was fixed for the service processor going to the reset state instead of the termination state when the anchor card is missing or broken.  At the termination state, the Advanced System Manager Interface (ASMI) can be used to collect failure data and debug the problem with the anchor card.
System firmware changes that affect certain systems
  • On systems with AIX or Linux encapsulated state partitions, a problem was fixed for a Live Partition Mobility migration failure for the encapsulated state partitions.  The migration fails on the target CEC when the associated paging space needed to support the encapsulated state is not available.  Removing the "Encapsulated State" attribute from the partition would allow the migration to succeed.  However, removing this attribute can only be accomplished if the partition in the powered off state.  Encapsulated State partitions are needed for the remote restart feature.  An encapsulated state partition is a partition in which the configuration information and the persistent data are stored external to the server on persistent storage.  A partition that supports remote restart can be restarted remotely.  For more information on the remote start feature, refer to this IBM Knowledge Center link: http://www.ibm.com/support/knowledgecenter/P8DEA/p8efd/p8efd_lpar_general_props.htm
  • For Integrated Virtualization Manager (IVM) managed systems with more than 64 active partitions, a problem was fixed for recovery from Live Partition Mobility (LPM) errors.  Without the fix, the IVM  managed system partition can appear to still be running LPM after LPM has aborted, preventing retries of the LPM operation.  In this case, the partition must be stopped and restarted to clear the LPM error state.  The problem is not frequent because it requires a failed LPM on a partition with a partition ID that is greater than 64.
  • On systems with an invalid P-side or T-side in the firmware, a problem was fixed in the partition firmware Real-Time Abstraction System (RTAS) so that system Vital Product Data (VPD) is returned at least from the valid side instead of returning no VPD data.   This allows AIX host commands such as lsmcode, lsvpd, and lsattr that rely on the VPD data to work to some extent even if there is one bad code side.  Without the fix,  all the VPD data is blocked from the OS until the invalid code side is recovered by either rejecting the firmware update or attempting to update the system firmware again.
  • A problem was fixed for an incorrect date in partitions created with a Simplified Remote Restart-Capable (SRR) attribute where the date is created as Epoch 01/01/1970 (MM/DD/YYYY).  Without the fix, the user must change the partition time of day when starting the partition for the first time to make it correct.  This problem only occurs with SRR partitions.
  • On systems using PowerVM firmware with dedicated processor partitions,  a problem was fixed for the dedicated processor partition becoming intermittently unresponsive. The problem can be circumvented by changing the partition to use shared processors.
SC820_091_047 / FW820.30

11/18/15
Impact:  Availability      Severity:  HIPER

New Features and Functions

  • The firmware code update process was enhanced with a feature to block a firmware "downgrade" to a level that is below the system's manufactured code level.
  • Support was added to the Advanced System Management Interface (ASMI) to be able to add a IPv4 static route definition for each ethernet interface on the service processor.  Using a static route definition,  a Hardware Management Console (HMC) configured on a private subnet that is different from the service processor subnet is now able to connect to the service processor and manage the CEC.  A static route persists until it is deleted or until the service processor settings are restored to manufacturing defaults.  The static route is managed with the ASMI panel "Network Services/Network Configuration/Static Route Configuration" IPv4 radio button.  The "Add" button is used to add a static route (only one is allowed for each ethernet interface) and the "Delete" button is used to delete the static route.

System firmware changes that affect all systems

  • HIPER/Pervasive:  A problem was fixed for recovering from embedded MultiMediaCard (eMMC) flash NAND errors and three other low-level boot errors that caused the service processor to go to a failed state with SRC B1817212 on systems with a single service processor.  On systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.  Other low-level boot errors included in this fix:
    1) A system reset to clear the boot registers may be erroneously handled as a chip reset causing the service processor to enter a stopped state and become unresponsive.
    2) Improves recovery for a defective file system partition table that causes the service processor to lose the ability to perform P and T (Permanent and Temporary) side switch.
    3) Do not fail on a dump partition full condition as this is normal when a service processor has a maximum number of service processor dumps active.
    For each of these issues, on systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.
  • HIPER/Non-Pervasive: A problem associated with workloads using transactional memory on PowerVM was discovered and is fixed in this service pack. The effect of the problem is non-deterministic but may include undetected corruption of data.
  • HIPER/Non-Pervasive:  A problem was fixed for recovery from PNOR flash memory corruption that causes the IPL to fail with SRC D143900C.  This is very rare and only has happened in IBM internal labs.  Without the fix, the service processor cannot correct the corruption in the PNOR.  If a system has the problem SRC and  cannot IPL,  then that system must be disruptively firmware updated to apply the fix to be able to IPL again.
  • DEFERRED:  A problem was fixed for memory on-die termination (ODT) settings to improve the signal integrity of the memory channel.
  • DEFERRED:  A problem was fixed for a TCP/IP performance degradation on PCIe ethernet adapters with Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE).  By adjusting the system memory caching, a significant improvement was made to the data throughput speed to restore performance to expected levels.  This fix requires a system re-IPL to take effect.
  • DEFERRED:  A problem was fixed for a hang in the processor and cache memory that causes a system checkstop with SRC B181E540 logged with a processor FRU callout.  The error log details include  "Description:  Runtime diagnostics has detected a problem on a memory bus" and "Signature Description:  mcs(n0p0c6) (MCIFIR[40]) CHANNEL TIMEOUT ERROR" and "Multi-Signature List:  ex(n0p0c14) (L3FIR[24]) L3 Hw Control Error".  The trigger for the hang error is speculative DMA partial writes into cache and the frequency of the error varies with the workload, but may happen several times a month.  A re-IPL of the system is needed for this fix to take effect after a concurrent firmware update of the service pack.
  • A problem was fix for certain error logs not being reported to the OS.  The error occurs when the hypervisor is not ready to receive an error log message and rejects it.  The error log handler on the service processor was not retrying until the error log was successfully delivered.  Until the fix is applied, there will be a small loss of error logs when the hypervisor is initializing during the IPL as these will get discarded until the hypervisor is ready.  The missing error logs may be viewed from the service processor using the Advanced System Management Interface (ASMI) or may be viewed as serviceable events on the management console if there is one attached.
  • A problem was fixed for the error reporting of multiple AC power losses so that all occurrences of the power losses are logged.  With the problem, only the first AC power loss for SRC 10001510 is reported, with subsequent power faults not being reported.  Until the fix is applied, a re-IPL of the CEC will re-enable power supply problem reporting.
  • A problem was fixed for a SRC 11002613 logged during a concurrent repair of a power supply.  This SRC was erroneously logged and did not represent a real problem.
  • A problem was fixed for an intermittent SRC B1504804 logged on a re-ipl of the CEC but that did not result in an IPL failure.  This problem is a inability of the service processor to do a read from the IIC bus resulting from incorrect device lock management.  This problem has no adverse impact on the system other than a predictive error log and can be ignored until the fix is applied.
  • A problem was fixed for a bad Time of Day (TOD) battery with SRC B15A3305 calling out the P1 Backplane instead of the P1-E2 Battery.  This occurs whenever the TOD battery becomes bad.  Until the fix is applied, always replace the battery FRU for this SRC as the first repair action.
  • A problem was fixed for the capture of the registers for the Hostboot Self-Boot Engine (SBE) for SBE failures.  These registers had been missing from failure data for SBE failures, making these problems more difficult to debug.
  • A problem was fixed for an Advanced System Management Interface (ASMI) error message of "Error in function 'connect", error code 111" when a browser attempted to connect before the service processor was ready.  The browser connection through the web server is now held off until the ASMI process is ready after a reset of the service processor or a AC power cycle of the system.  Until the fix is applied, the ASMI user can wait one or two minutes and then retry the operation.
  • A problem was fixed for an incorrect call home for SRC B1818A0F.  This call home can be ignored.  It occurs rarely only in the case of dynamic IP configuration for the service processor when it fails to acquire a IP address from the Dynamic Host Configuration Protocol (DHCP) server.  Unit the fix is applied, use the information from the SRC and network topology  to understand why the DHCP client cannot acquire an IP address as this is normally a network configuration error.
  • A problem was fixed for a system dump re-IPL that failed with SRC B1818601 and B181460B after processor core checkstops had terminated the system.  The failed processor cores created a complex condition that prevented a successful dump collection of all the hardware objects.  Until the fix is applied, the checkstop processor problems will have to be debugged with partial data from the degraded dump collections that have the failure SRCs.
  • A problem was fixed for an infrequent service processor database corruption during concurrent firmware update that caused the system to terminate with a UIRA impact to the customer.  The cause of the database corruption is undetermined but the problem is resolved by the service processor making a backup of the data that can be restored, if needed, to allow the firmware updates to complete successfully.
  • A problem was fixed for Advanced System Management Interface (ASMI) TTY to allow "admin" passwords to be greater than eight characters in length to be consistent with prior generations of the product.  The ASMI web interface works correctly for user "admin" passwords with no truncation in the length of the passwords.
  • A problem was fixed for a local clock card (LCC)  failure with SRC 11001515 that was missing a part number and location code.  This information has been added for LCC faults so the FRU to replace is properly identified.
  • A problem was fixed for a defective PCI oscillator in the local clock card (LCC) with SRC BC58090F that caused a IPL failure for the node instead of failing over to the redundant LCC.  For a multi-node system,  the failure is isolated to the node with the bad LCC and the other nodes are able to IPL.
  • A problem was fixed for a service processor dump with error logs  B181E911 and B181D172 during an IPL.  The error logs were for the detection of defunct processes but otherwise the IPL was successful.
  • A problem was fixed for missing Keyword (KW) and Resource ID (RID) for SRC B181A40F.
  • A problem was fixed for a I2C bus lock error during a CEC power off that caused a ten minute delay for the power off and  errorlog SRCs  B1561314 and B1814803 with error number (errno) 3E.
  • A problem was fixed for Advanced System Management Interface (ASMI) help text for menu "I/O Adapter Enlarged Capacity" being missing with the system IPLed and partitions running.  The help text, shown below, is now available for the system in the powered on state as well as in the powered off state.
    "I/O Adapter Enlarged Capacity
    This option controls the size of PCI memory space allocated to each PCI slot.
    When enabled, the selected number of PCI slots, including those in external I/O subsystems, receive the larger DMA and memory mapped address space.
    Some PCI adapters may require this additional DMA or memory space, per the adapter specification.
    This option increases system mainstore allocation to these selected PCI slots.
    Enabling this option may result in some PCI host bridges and slots not being configured because the installed mainstore is insufficient to configure all installed PCI slots."
  • A problem was fixed for recovering from a misplug of the service processor FSI cables (U2-P1-C10-T2 and U1-P1-C9-T2) where the plug locations are reversed from what would be a proper connection.  Without the fix, the bad FSI connections cause the service processors to go to the service processor stop state.  With the fix applied, the error logs call out the bad cables so they can be repaired and the service processor remains in a working state.
  • For a partition that has been migrated with Live Partition Mobility (LPM) from FW730 to FW740 or later, a problem was fixed for a Main Storage Dump (MSD) IPL failing with SRC B2006008.  The MSD IPL can happen after a system failure and is used to collect failure data.  If the partition is rebooted anytime after the migration, the problem cannot happen.  The potential for the problem existed between the active migration and a partition reboot.
  • A problem was fixed for partial loss of Entitlement for On/Off Memory Capacity On Demand (also called Elastic COD).  Users with large amounts of Entitlement on the system of greater than "65535 GB * Days" could have had a truncation of the Entitlement value on a re-IPL of the system.  To recover lost Entitlement, the customer can request another On/Off Enablement Code from IBM support to "re-fill" their entitlement.
  • A problem was fixed for a management console command line failure with a return code 0x40000147 (invalid lock state) when trying to delete SR-IOV shared mode configurations.  This could have occurred if the adapter slot had been re-purposed without involvement of the management console and was owned and operational at the time of the requested delete.  With the fix, the current ownership of the slot is honored and only the SR-IOV shared mode configuration data is deleted on the force delete.
  • A problem was fixed for an incorrect restriction on the amount of "Unreturned"  resources allowed for a Power Enterprise Pool (PEP).  PEP allows for logical moving of resources (processors and memory) from one server to another.  Part of this is 'borrowing' resources from one server to move to another. This may result in "Unreturned" resources on the source server. The management console controls how many total "Unreturned" PEP resources can exist.  For this problem,  the user had some "Unreturned" PEP memory and asked to borrow more but this request was incorrectly refused by the hypervisor.
  • On systems where memory relocation (as done by using Live Partition Mobility (LPM)) and a partition reboot are occurring simultaneously, a problem for a system termination was fixed.  The potential for the problem existed between the active migration and the partition reboot.
  • A problem was fixed that was corrupting the Update Access Key (UAK) date with a corrupted date of "1900".   The user should correct the UAK date, if needed, to allow the firmware update to proceed, by using the original UAK key for the system.  On the Management Console,  enter the original update access key via the "Enter COD Code" panel. Or on the Advanced System Management Interface (ASMI),  enter the original update access key via the "On Demand Utilities/COD Activation" panel.
  • A problem was fixed for recovery from unaligned addresses for MSI interrupts from PCIe adapters.  The recovery prevents an adapter timeout caused by resource exhaustion.  With the fix, the resources for each bad interrupt are returned, allowing the PCIe adapter to continue to run for the normal traffic.
  • A problem was fixed for a machine check incorrectly issued to an IBM i partition running 7.2 or later with 4K sector disks.
  • A problem was fixed for an extraneous PCIe switch SRC B7006A22 being called out when a there is a valid PCIe  expansion drawer cable problem with SRC B7006A88 reported.  The callout for SRC B7006A22 should be ignored as the PCIe switch hardware is working for this case.
  • A problem was fixed for a Network boot/install failure using bootp in a network with switches using the Spanning Tree Protocol (STP).  A Network boot/install using lpar_netboot on the management console was enhanced to allow the number of retries to be increased.  If the user is not using lpar_netboot, the number of bootp retries can be increased using the SMS menus.  If the SMS menus are not an option, the STP in the switch can be set up to allow packets to pass through while the switch is learning the network configuration.
  • A problem was fixed for PCIe3 adapters failing when requesting more than 32 Message Signaled Interrupts (MSI-X).  The adapter may fail to ping or cause OS tasks to hang that are using the adapter.  This problem was found specifically on the 10 Gb Ethernet-SR (Short Range) PCIe3 adapter with feature codes #5275 and #5769 and on the 56 Gb Infiniband (IB) Fourteen Data Rate (FDR) adapter with feature codes #EC32, #EC33, #EL3D, and #EL50 and CCIN 2CE7.  However, other PCIe adapters may also be affected.
  • A security problem was fixed for an OpenSSL specially crafted X.509 certificate that could cause the service processor to reset in a denial-of-service (DOS) attack.  The Common Vulnerabilities and Exposures issue number is CVE-2015-1789.
  • A problem was fixed for false errors reported with SRC B1812663 for the On-Chip Controller (OCC).  These error logs can be ignored as these are caused by a prior error log using a buffer that is not properly sized for the log data.
  • A problem was fixed to prevent recoverable power faults of short duration from causing the system to lose power supply redundancy.  Without the fix, the faulted state persisted for the recovered power fault, causing a problem with a system power off if other power supplies were lost at a later time.
  • A problem was fixed to guard a failed processor during an IPL instead of hanging with SRC B1813450 reported to the error log.
  • A problem was fixed for an intermittent PSI link error with SRC B15CDA27 after a firmware update or reset/reload of the service processor.
  • A problem was fixed for hardware system dump collection after a hardware checkstop that was missing scan ring data.  This is a very infrequent problem caused by an error with timing in the multi-threaded dump collection process.  Until this fix is applied, the debug of some hardware dump problems may require doing multiple dump collections to get all the data.
  • A problem was fixed for an Advanced System Managementr Interface (ASMI) error that occurred when trying to display detail on a deconfigured Anchor Card VPD.  If the error log for the selected deconfiguration record had been deleted, it caused ASMI to core dump.  With the fix,  if the error log for deconfiguration record is missing, the error log details such as failing SRC for the deconfiguration record are returned as blank.
  • A problem was fixed for an Operations Panel SRC of B1504804 with no FRU callout.  A callout of the failed hardware has been added.
  • A problem was fixed for guarding failed hardware dynamically during the IPL to prevent the IPL from terminating.  Without the fix,  certain hardware failures will not be called out to handled by the reconfiguration loop,  Until the fix is applied, multiple IPL attempts may be needed if hardware is failing.
  • A problem was fixed for a processor error causing a Hostboot terminate instead of a deconfiguration of the bad hardware and continuation of the IPL.  The state of the processors was synchronized between the service processor and the Hostboot process to correct the error.
  • A problem was fixed for the recovery of a failing PCI clock so that a failover to the backup PCI clock occurs without a node failing and being deconfigured.  Without the fix, the PCI clock does not behave as a redundant FRU and faults on it will cause the CEC to terminate.  A re-IPL of the CEC recovers it from the PCI clock error with the bad clock guarded so that the other PCI clock is used.
  • A problem was for fixed so that error logs are now generated for thermal errors detected by the service processor.  Without the fix, thermal errors such as a temperature over the threshold will not get reported in the error log but higher fan speeds will be present as an indicator of the thermal problem.  Until the fix is applied, the error log and call home mechanism cannot be relied on to monitor for system thermal problems.
  • A problem was fixed for processor core checkstops that cause an LPAR outage but do not create hardware errors and service events.  The processor core is deconfigured correctly for the error.  This can happen if the hypervisor forces processor checkstops in response to excessive processor recovery.
  • A problem was fixed for recovery from a processor local bus (PLB) hang on the service processor.  The errant PLB hang recovery would be seen in concurrent firmware updates that, on rare occasions, fail to do a side switch to activate to the new level of firmware.  On the management console, the error message would be HSCF010180E Operation failed ... E302F873 is the error code."  Other than the failed code level activation, the firmware update is successful.  If this problem occurs, the system can be set to the new firmware level by doing a power off from the management console and then doing a power on with side switch selected in the advanced properties.
System firmware changes that affect certain systems
  • On a system with redundant service processors where redundancy is disabled, a problem was fixed for an unrecoverable (UE) SRC B181DA19 being logged on a re-IPL after a checkstop error.  The error log did not interfere with the re-IPL which was successful.  The error log is for an active Processor Support Interface (PSI) link not being found for the backup service processor.  This is a correct condition when redundancy is disabled, so the error log should not have been generated.  Until the fix is applied, the error code can be ignored.
  • On multiple-node systems, a problem was fixed for extraneous error logs after a 12V power fault with SRC 11002610.  After system termination, there are additional 110026B0 and 110026B3 error log entries that can be ignored.
  • On a system with redundant service processors, a problem was fixed for the isolation procedures for an Anchor card error and system VPD collection failure with termination SRC B181A40F .  FSPSP04 and FSPSP06 are no longer called out as part of reporting the VPD collection failure.  FSPSP30 has been updated with isolation steps for this problem and is called out and should be used for the problem isolation.  Retain tip H213935 also provides the FRU isolation steps.  Procedure FSPSP30 tries to replace the service processor first.  If that does not work, then the procedure has the Anchor card replaced.
  • On a system with redundant service processors, a problem was fixed for failovers to the backup service processor that caused an On-Chip Controller (OCC) abort.  This placed the CEC in a "safe" mode where it ran at reduced processor clock frequencies to prevent exceeding the power limits while not under OCC control.
  • On a system with an IBM i partition using Active Memory Sharing (AMS),  a problem was fixed for internal memory management errors caused by deleting a IBM i partition that had been powered off in the middle of a Main Storage Dump (MSD).  Until the fix is installed, if a MSD is interrupted for a IBM i partition that has AMS, the partition should be powered on and powered off normally before a delete of the partition is done to prevent errors with unpredictable affects.
  • On systems using PCIe adapters in SR-IOV mode, a problem was fixed for occasional B200F011 and B2009008 SRCs that can occur during an IPL, moving a adapter into SR-IOV mode, or with SR-IOV link up/down activity.
  • On systems using PCIe adapters in SR-IOV mode,  the following problems were addressed with an Avago Technologies adapter firmware update to 10.2.252.1905:  1) Eliminating virtual function (VF) transmit errors during VF resets and 2) Preventing  loss of legacy flow control when an adapter port is connected to a priority flow control (PFC) capable switch.
  • On a system with redundant service processors, a problem was fixed for a firmware update causing an error log server dump with SRC B1818601.  The error log server restarted automatically to recover from the error and the firmware update was successful.
  • On a system with a AIX partition and a Linux partition, a problem was fixed for dynamically moving an adapter that uses DMA from the Linux partition to the AIX partition that caused the AIX to fail by going into KDB mode (0c20 crash).  The management console showed the following message for the partition operation:  "Dynamic move of I/O resources failed.  The I/O slot dynamic partitioning operation failed.".  The error was caused by Linux using 64K mappings for the DMA window and AIX using 4K mappings for the DMA window, causing incorrect calculations on the AIX when it received the adapter.  Until the fix is applied, the adapters that use DMA should only be moved from Linux to AIX when the partitions are powered off.
  • On a system with redundant service processors, a problem was fixed for an IPL failure for a bad service processor cable on the primary service processor with SRCs B1504904 and B18ABAAB logged.  The system should have did an error failover to the backup service processor and continued the IPL to get the partitions running.
SC820_087_047 / FW820.21

09/24/15
Impact:  Performance    Severity:  HIPER

System firmware changes that affect certain systems

  • HIPER/Pervasive:  On systems using PowerVM with shared processor partitions that are configured as capped or in a shared processor pool, there was a problem found that delayed the dispatching of the virtual processors which caused performance to be degraded in some situations.  Partitions with dedicated processors are not affected.   The problem is rare and can be mitigated, until the service pack is applied, by creating a new shared processor AIX or Linux partition and booting it to the SMS prompt; there is no need to install an operating system on this partition.  Refer to help document http://www.ibm.com/support/docview.wss?uid=nas8N1020863 for additional details.
SC820_085_047 / FW820.20

07/16/15
Impact:  Availability      Severity:  SPE

New Features and Functions

  • Support was added to the Advanced System Management Interface (ASMI) to display Anchor card VPD failures in the "Deconfigurations records" menu.

System firmware changes that affect all systems

  • DEFERRED: A problem was fixed for the fabric bus to allow a processor clock failover to be completed without a checkstop of the CEC.   A skew between the primary and secondary processor clock signal was eliminated to fix the problem.
  • DEFERRED: On systems with memory mirroring enabled, a problem was fixed for PowerVM over-estimating its memory needs, allowing more memory to be used by the partitions.  To free up the memory for the partitions that the hypervisor does not need, the CEC must be re-ipled after the fix is applied.
  • DEFERRED: A problem was fixed for the hypervisor being unable to make a partition configuration change when all licensed memory is in use by the partitions. An insufficient storage error is returned to the management console and the management console may go to the incomplete state for the CEC..  The hypervisor management of memory fragments has been improved so that partition configuration changes can be made when all licensed memory is in use.  To make this additional memory available for the partition changes,  the CEC must be re-ipled after the fix is applied.
  • A problem was fixed for a missing SRC if the operations panel failed while the system was running.  A B156A023 SRC is now logged if the operations panel fails or is removed while the system is running.
  • A problem was fixed that prevented a second management console from being added to the CEC.  In some cases, network outages caused defunct management console connection entries to remain in the service processor connection table,  making connection slots unavailable for new management consoles  A reset of the service processor could be used to remove the defunct entries.
  • A problem was fixed for a missing SRC when a Universal Power Interconnect Cable (UPIC) to the system control unit (SCU) failed or became loose while the system was running.  Up to four hot pluggable UPIC cables (#ECCA and #ECCB) provide redundant power to the SCU but only one is needed for operation.  When a UPIC cable fails now, a SRC 11008802 is logged and calls out the lost of one of the redundant power cables.
  • A problem was fixed for a false guarding and call out of a PSI link with SRC B15CDA27.  This failure is very infrequent but sometimes seen after the reset/reload of the service processor during a concurrent firmware update.   Since there is no actual hardware failure, a manual unguarding of the PSI link allows it to be reused.
  • A problem has been fix for the LED lights being interchanged for the Universal Power Interconnect Cable (UPIC) and the GFSP interface card FRUs on the system node.  The GFSP interface card has CCIN 6B2E and part number 00E2598 with location codes of Un-P1-C9-T2 and Un-P1-C10-T2.  The UPIC cables have part numbers 00FX185 and 00FX186 with location codes Un-P1-C9-T1 and Un-P1-C10-T1.
  • A problem was fixed for a CEC power off error with SRC B1818903 logged.  The error causes a dump and reset of the service processor that allows the power off operation to complete.
  • A problem was fixed for a two to four minute delay that could occur when performing an Administrative Failover (AFO) of the service processor.  An On-Chip Controller (OCC) deadlock was occurring in the service processor, leaving both service processors into the backup role.   This error state is automatically corrected by the hypervisor with a host-initiated reset/reload when it cannot find a service processor in the primary role after the delay time-out period.
  • A problem was fixed for losing power capping capability in the On-Chip Controllers (OCCs) after a service processor failover.  When this occurs. an UE B1702A03 SRC is logged by the OCC.  To restore power capping,  shut down all partitions and power off the CEC.  IPL the CEC again to restore power capping.
  • A problem was fixed for the error handling of a Local Clock and Control(LCC) card failure in a system node that triggers a flood of FDAL informational SRCs of B1504800 to the error log, causing the service processor to run out of memory and reset with a failover to the backup service processor.  The LCC has CCIN 682D and part number 00E2394 with location codes Un-P1-C11 and Un-P1-C12 as it is redundant in each system node.
  • A problem was fixed for a IPL failure with SRC B181BC04 when a system node was added to the CEC at service processor standby.  The new system node hardware was not added correctly to the hardware scan ring and a AC power cycle of the CEC was needed to fix the error.
  • A problem was fixed for missing hardware data in system dumps created for hardware checkstops.  A certain class of hardware scan rings were being skipped during the dump collection and these are now included so that all the hardware data is available for problem debug.
  • A problem was fixed for missing "fastarray" data in hardware dump type HWPROC.  The "fastarray" contains debug information for the processor cores.
  • A problem was fixed for the Advanced System Management Interface (ASMI) to allow removal of Hardware Management Console (HMC) connections that have been temporarily disconnected.  In some instances, the ASMI "System Configuration/Hardware Management Consoles" button for  "Remove Connection"  was not being shown.
  • A problem was fixed for the Advanced System Management Interface (ASMI)  IPv4 Network Configuration where the IP address was being overwritten by value in the subnet mask field for the initial values of the panel.  If the network configuration was saved without fixing the IP address, the wrong IP address was also saved.
  • A problem was fixed for missing call outs when having multiple "Memory Card/FRU" failures with SRC B124E504.  There is a call out for the first memory FRU of the failures but any other memory FRUs failing at the same time were not reported.
  • A problem was fixed for Administrative Failover (AFO) having error log SRC B1818601.  This error did not prevent the AFO from completing as the backup service processor became the primary service processor.
  • A problem was fixed for an intermittent problem in a CEC IPL where an On-Chip Controller is stuck in a reset loop, logging repeated SRCs for B1702A17, and eventually places the CEC in safe mode, running at minimum processor clock frequencies.
  • A problem was fixed for errors during a CEC power off with SRCs B1812616 and B1812601.  These occurred if the CEC was powered off immediately after a power on such that the On-Chip Controllers (OCCs) had to shutdown during their initialization.
  • A problem was fixed for a highly intermittent IPL failure with SRC B18187D9 caused by a defunct attention handler process.  Without this fix, the IPL will continue to fail until the service processor is reset.
  • A problem was fixed to add the callouts for the fan FRUs for system fan faults with SRCs 11007610, 11007620, and 11007630.  The fan FRU with CCIN 6B42, part number 00E9335, and location code Un-A1 is now included as needed.
  • A problem was fixed for an Administrative Failover (AFO) having error log SRC B185270E.  This error did not prevent the AFO from completing as the backup service processor became the primary service processor.   The error log has been made informational as it is a normal occurrence when fan speeds are adjusted.
  • A problem was fixed to allow adding a system node with only one working Local Clock and Control (LCC) card and being able to IPL the system node.  The LCC is redundant, so a broken or missing LCC should not cause an IPL to fail.  The problem can be circumvented by using the Advanced System Management Interface (ASMI) command line on the primary service processor to run this command "rmgrcmd --primary-lcc force-init" and then do the IPL.
  • A problem was fixed for finding the path to the second Local Clock and Control (LCC) card when a LCC card has failed to ensure proper redundancy for the LCC and the system node.
  • A problem was fixed for incorrect FRU callouts for Power Line Disturbance (PLD) and Processor clock errors.
  • A problem was fixed for extra FRU callouts being listed for SRCs with multiple FRU callouts.  The extra callouts are from previous SRCs and should not have been listed for the current error log entry.
  • A problem was fixed for the Advanced System Management Interface (ASMI) being allowed to deconfigure a node in a single-node system.  A safe guard was added so that ASMI can only deconfigure nodes in multi-node CECs.
  • A problem was fixed to include PCIe clocks as part of the minimum hardware check during an IPL.  Previously, no error was logged when a system had no functional PCIe clocks, causing run-time failures for PCIe I/O operations in partitions.
  • A problem was fixed for missing FRU information in SRC 11001515.   SRC 11001515 was logged indicating replacement of power supply hardware, but did not include the location code, the part number, the CCIN, or the serial number.
  • A problem was fixed for concurrent firmware update after concurrent PCIe adapter maintenance (add, remove, exchange,etc.) causing the CEC to enter safe mode with its reduced performance.  In safe mode, the processor voltage/frequency is reduced to a "safe" level where thermal monitoring is not required.  Recovery from safe mode requires a system re-IPL.
  • A problem was fixed for an Administrative Failover (AFO) failing with the backup service processor terminating with UE SRCs B15738FD and B1573838.  This failure  was caused by an intermittent error with the operations panel presence detection during failover.
  • A problem was fixed for an Administrative Failover (AFO) having error log SRC B1814616 and a fwdbserver core dump.  This error did not prevent the AFO from completing as the backup service processor became the primary service processor.
  • A problem was fixed for a hypervisor deadlock that results in the system being in a "Incomplete state" as seen on the management console.  This deadlock is the result of two hypervisor tasks using the same locking mechanism for handling requests between the partitions and the management console.  Except for the loss of the management console control of the system, the system is operating normally when the "Incomplete state" occurs.
  • A problem was fixed for Live Partition Mobility (LPM) migrations of Linux partitions running in P8 compatibility mode.  After an active migration, the resumed partition may experience performance degradation.
  • A problem was fixed for a false error message with error code 0x8006 when creating a virtual ethernet adapter with the Integrated Virtualization Manager (IVM).  The error message can be ignored as the virtual ethernet slot is fully functional.
  • A problem was fixed for the recovery of PCIe adapters for a device outage occurring on the PCIe3 6-slot fanout module from the PCIe3  I/O expansion drawer (#EMX0).  One or more of the adapters on the fanout module failed to recover with SRC BA188002.
  • A problem was fixed for an unexpected interrupt from a PCIe adapter that causes the AIX OS to abend.  The extra interrupt comes in from the adapter before it has been enabled for interrupts, after it has reached End of Information (EOI) for its previous session.  The double interrupt from the adapter has been corrected.
  • On systems using PowerVM, a problem was fixed for the handling of the error of multiple cache hits in the instruction effective-to-real address translation cache (IERAT).  A multi-hit IERAT error was causing system termination with SRC B700F105.  The multi-hit IERAT is now recognized by the hypervisor and reported to the OS where it is handled.
  • A problem was fixed for a MDC D-mode IPL that failed if the MDC load source slots were unoccupied.
  • A problem was fixed for systems with a corrupted date of "1900" showing for the Update Access Key (UAK).  The firmware update is allowed to proceed on systems with a bad UAK date because the override is set for the service pack.  After the fix is installed, the user should correct the UAK date, if needed, by using the original UAK key for the system.  On the Management Console,  enter the original update access key via the "Enter COD Code" panel. Or on the Advanced System Manager Interface (ASMI),  enter the original update access key via the "On Demand Utilities/COD Activation" panel.
  • A problem was fixed for a hang during a Dynamic Platform Optimizer (DPO) operation. A system re-IPL was needed to end the DPO operation.
  • A problem was fixed for concurrent firmware updates to a system that needed to be re-IPLed after getting a B113E504 SRC during activation of the new firmware level on the hypervisor.  The code update activate failed if the Sleep Winkle (SLW) images were significantly different between the firmware levels.  The SLW contains the state of the processor and cache so it can be restored after sleep or power saving operations.
  • Support was added for USB 2.0 HUBs so that a keyboard plugged into the USB 2.0 HUB will work correctly at the SMS menus.  Previously, a keyboard plugged into a USB 2.0 HUB was not a recognized device.
  • A problem was fixed for Live Partition Mobility (LPM) to prevent a system failure with SRC B700F103 during LPM operations.  When data is moved by LPM, the underlying firmware code requires that the buffers be 4K aligned, otherwise the system fail could result.  The fixes made now force the buffers to be 4K aligned and if there is still an alignment issue, the LPM operation will fail without impacting the system.
  • A problem was fixed in the run-time abstraction services (RTAS) extended error handling (EEH) recovery for EEH events for SR-IOV Virtual Functions (VFs) to fully reconfigure the VF devices after an EEH event.  Since the physical adapter does recover from the EEH event itself, and there are no error logs generated, it might not be immediately apparent that the VF did not fully reconfigure.  This prevents certain PCIe settings from being established for interrupts and performance settings, leading to unexpected adapter behavior and errors in the partition.
  • A security problem was fixed in OpenSSL where a remote attacker could crash the service processor with a specially crafted X.509 certificate that causes an invalid pointer or an out-of-bounds write. The Common Vulnerabilities and Exposures issue numbers are CVE-2015-0286 and CVE-2015-0287.
  • A problem was fixed for an error log SRC B15738B0 with no FRU callout for a FSI bus error.
  • A problem was fixed for an error log SRC B1504803 with no FRU callout for a IIC bus error.
  • A problem was fixed for a memory error that prevented the CEC from doing an IPL.  The failing DIMM is now deconfigured during the HostBoot part of the IPL and the failing section of the boot is retried to get a successful IPL.
  • A problem was fixed for a checkstop that occurred for a failed Local Clock and Control (LCC) card instead of a failover to the backup LCC card.   The fabric bus erroneously detected a TOD step error during the failover and triggered the checkstop.
  • A problem was fixed for an On-Chip Controller (OCC) failure after a system dump with SRCs B18B2616 and BC822024 reported.  This resulted in the system running with reduced performance in safe mode, where processor clock frequencies are lowered to minimum levels to avoid hardware errors since the OCC is not available to monitor the system.   A re-IPL of the system would resolve the problem.
  • A problem was fixed for new service processor error logs not getting created if too many old error logs exist.  This problem can occur if a large number of small error logs get created and use up all the available inodes (directory entries) for the file system.  The error log garbage collector was not checking the available number of inodes correctly, so it was not always deleting old error logs before attempting to create a new error log.   Without the fix,  this problem will continue until some error logs are purged.
SC820_075_047 / FW820.12

05/18/15
Impact: Function         Severity:  ATT

System firmware changes that affect all systems

  • A problem was fixed for a clearing of all guard records associated with one error log entry.  If a FRU is replaced for any of the related guard record, all the related guard records are cleared.  Previously, only the guard record for the replaced FRU was cleared and the association was lost.
  • A fix was made to prevent processor speculative memory loads from the service processor mailbox Direct Memory Access (DMA) area in the CEC memory.  The speculative loads caused memory cache faults and system checkstops with SRC B181E540.
  • A problem was fixed to reduce switching noise on the memory address bus for DIMMs.  Noise on the bus could cause a failure for a marginal DIMM, so this fix has the effect of potentially improving the reliability of the memory.
SC820_070_047 / FW820.11

04/03/15
Impact: Function         Severity:  SPE

System firmware changes that affect certain systems

  • On systems with a large number of memory DIMMs (64 or more) and redundant service processors, a problem was fixed for a firmware update failure with SRC E302F966 when a failover was attempted as part of the firmware update, but the service processors did not change roles.  This also fixes failing Administrative Failovers (AFOs) for systems with large memory.  The performance of the CEC memory initialization was improved to prevent the hypervisor time-outs for service processor failovers.
SC820_067_047 / FW820.10

03/12/15
Impact:  Security      Severity:  HIPER

New Features and Functions

  • Support for setting Power Management Tuning Parameters from the management console (Fixed Maximum Frequency (FMF), Idle Power Save, and DPS Tunables) without needing to use the Advanced System Management Interface (ASMI) on the service processor.  This allows FMF mode to be set by default without having to modify any tunable parameters using ASMI.
  • Support for SSLv3 has been discontinued to reduce security vulnerabilities in the secured connections to the service processor.
  • Support was added for Single Root I/O Virtualization (SR-IOV) that enables the hypervisor to share a SR-IOV-capable PCI-Express adapter across multiple partitions. Two Ethernet adapters are supported with the SR-IOV NIC capability, when placed in the Power E880/E870:
    •    PCIe2 LP 4-port (10Gb FCoE and 1GbE) SR&RJ45 Adapter (#EN0L)
    •    PCIe2 LP 4-port (10Gb FCoE and 1GbE) SFP+Copper and RJ4 Adapter (#EN0J)
    These adapters each have four ports, and all four ports are enabled with SR-IOV function. The entire adapter (all four ports) is configured for SR-IOV or none of the ports is.
    System firmware updates the adapter firmware level on these adapters to 10.2.252.16 when a supported adapter is placed into SR-IOV mode.
    Support for SR-IOV adapter sharing is not yet available for adapters is a PCIe Gen3 I/O Expansion Drawer.
    SR-IOV NIC on the Power E870/E880 is supported by:
    •    AIX 6.1 TL9 SP4 and APAR IV63331, or later
    •    AIX 7.1 TL3 SP4 and APAR IV63332, or later
    •    IBM i 7.1 TR9, or later
    •    IBM i 7.2 TR1, or later
    •    Red Hat Enterprise Linux 6.5, or later
    •    Red Hat Enterprise Linux 7, or later
    •    SUSE Linux Enterprise Server 11 SP3, or later
    -           VIOS 2.2.3.4 with interim fix IV63331, or later

System firmware changes that affect all systems

  • HIPER/Pervasive:  A problem was fixed for a processor clock failover with SRC B158CC62 that caused a system checkstop when the backup clock oscillator did not initialize fast enough.
  • A problem was fixed for the iptables process consuming all available memory, causing an out of memory dump and reset/reload of the service processor.
  • A problem was fixed for a PowerVM hypervisor hang after a processor core and system checkstop.  The failed processor core was not put into a guarded state and the hypervisor hung when it tried to use the failed core.
  • A problem was fixed for a oscillator error caused by a power line disturbance that logged an UE SRC B150CC62 with no FRU call outs.  The  error SRC was changed from unrecoverable to informational as no service action is required.
  • A problem was fixed for the NEBS DC power supply showing up in the part inventories for the CEC as "IBM AC PS".  The description string has been changed to "IBM PS" as power supplies can be of DC or AC type.
  • A problem was fixed for the power supplies to add a monitor process for the second rotor in each power supply that was not being monitored.  This will improve fault isolation for power supply problems.  A fix for the second rotor in an earlier service pack release provided the monitor infrastructure but was missing the monitor process.
  • A problem was fixed for a FSI link heartbeat surveillance fault with SRC B1504813 logged that has no FRU call outs.  The FRU call outs have been added.
  • A problem was fixed with the Advanced System Management Interface (ASMI) VPD menu where the Generic External Connector (GC) FRU was displayed as an unknown FRU type.  The "Unknown" has been replaced with "Generic External Connector".
  • A problem was fixed for a system fan identify LED not being able to light after a Digital Power Systems Sweep (DPSS) chip failover.  The fan LED ownership was not transferred to the new primary DPSS chip, so it was unable to light the LED under fan fault conditions.
  • A problem was fixed for SRC B1104800 having duplicate FRU call outs for the PNOR flash FRU.
  • A problem was fixed to prevent the Advanced System Management Interface (ASMI) "System Service Aids/Factory Configuration" panel option from restoring to factory configuration for FSP or ALL if one boot side of the service processor is marked invalid.  The following informational message is issued:  "The request cannot be performed because a firmware boot side is marked invalid.  This state may have been caused by a previous firmware update failure."
  • A problem was fixed for error log with SRC B150DA19,  created on the backup service processor for a PSI link failure detected on the primary,  not being visible in the error logs on the primary service processor.
  • A problem was fixed in the hardware server to prevent a UE B181BA07 abort when a host boot dump collection is in progress.
  • A problem was fixed for an LED fault with SRC B181A734 that occurred during a normal rebuild of the LED tables, resulting in the LED not being lit.  The problem has been fixed using retries for LEDs that are in a busy state.
  • A problem was fixed for a PSI link failure with SRC B1517212 that resulted in a service processor stop state.  The correct state for a system with broken PSI links is the terminate state so the problem can be resolved with a call home service event.
  • A problem was fixed to prevent false oscillator error logs of SRC B150CC62 for errors unrelated to clock failures.
  • A security problem was fixed in OpenSSL for padding-oracle attacks known as Padding Oracle On Downgraded Legacy Encryption (POODLE).  This attack allows a man-in-the-middle attacker to obtain a plain text version of the encrypted session data. The Common Vulnerabilities and Exposures issue number is CVE-2014-3566.  The service processor POODLE fix is implemented by disabling SSL protocol SSLv3 and requiring TLSv1.2 protocol on all secured connections.  The Hardware Management Console (HMC) also requires a POODLE fix for APAR MB03867(FIX FOR CVE-2014-3566 FOR HMC V8 R8.2.0 SP1 with PTF MH01455).  This HMC minimum requirement is enforced by the firmware update process for this defect.
  • A problem was fixed for firmware updates that caused the primary service processor to be guarded and SRC B152E6D0 and SRCs of form B181XXXX to be logged.
  • A problem was fixed for intermittent firmware database errors that logged an UE SRC of B1818611 and had a fwdbServer core dump.
  • A problem was fixed to enable the redundant Vital Product Data (VPD) SEEPROM for processors and voltage regulator modules (VRMs).  Previously, only the primary SEEPROM was programmed with the FRU data with no backup protection.
  • A problem was fixed for vague error text for SRC B1504922 for a bad SMP cable.  It was made more specific to state that an incorrect cable length was detected.
  • A problem was fixed for an intermittent reset/reload of the service processor during the early part of an IPL with SRC B1814616 logged.
  • A problem was fixed for hardware presence detection and local clock card (LCC) failover.  The system could not detect critical system hardware with th e default LCC missing, causing an error when failing over to the backup LCC.
  • A problem was fixed for non-optimal voltage levels from the power supplies.  Having the power supply output voltages meet the exact specifications will help prevent stress-related hardware failures.
  • A problem was fixed for an error in the "Enlarged IO Capacity Slot Count" that caused more memory than expected to be consumed by the hypervisor.  If the "Enlarged IO Capacity Slot Count" was not a "1", it was wrongly changed to an "8" by the IPL process, increasing the amount of memory that needs to be reserved for I/O buffers.  Retain tip H213684 tells how to reduce the hypervisor memory consumption when this problem happens as the fix will not change the value automatically:
    With the system at the "Power Off" state, take the following actions to to free up some memory from the hypervisor:
    - Log into ASMI and then select "System Configuration" menu    
    - Select  "I/O Adapter Enlarged Capacity" option                
    - Use the pulldown to select "1" as the new value for all nodes
    - After changing the value click on the "Save" setting. The change will be active on the next IPL of the system.
  • A problem was fixed for the PCIe reset line (PERST) to keep it active during the IPL until both system power and clocks are stable.  Keeping the PCIe devices in reset until the environment is stable prevents PCIe device lockup.
  • A problem was fixed to prevent a hypervisor task failure if multiple resource dumps running concurrently run out of dump buffer space.  The failed hypervisor task could prevent basic logical partition operations from working.
  • On systems using the Virtual I/O Server (VIOS) to share physical I/O resources among client logical partitions, a problem was fixed for memory relocation errors during page migrations for the virtual control blocks.  These errors caused a CEC termination with SRC B700F103.  The memory relocation could be part of the processing for the Dynamic Platform Optimizer (DPO), Active Memory Sharing (AMS) between partitions, mirrored memory defragmentation, or a concurrent FRU repair.
  • A problem was fixed that could result in unpredictable behavior if a memory UE is encountered while relocating the contents of a logical memory block during one of these operations:
    - Reducing the size of an Active Memory Sharing (AMS) pool.
    - On systems using mirrored memory, using the memory mirroring optimization tool.
    - Performing a Dynamic Platform Optimizer (DPO) operation.
  • A problem was fixed for PCIe link width faults on the  I/O expansion drawer (F/C #EMX0) to only log the SRC B7006A8B once for each FRU instead of having multiple SRCs and call outs for the same part.
  • A problem was fixed for a wrong state for the PCIe link LEDs (lit when link has failed) to the I/O expansion drawer with feature code #EMX0.  The fix insures that the link operational LEDs are not lit when the link to the I/O drawer has failed.
  • A problem was fixed for an incorrect SRC of B7006A9F logged for I/O drawer VPD mismatch during an enclosure serial number update of the I/O drawer (F/C #EMX0).  The incorrect SRC was logged if the non-primary service path module (right bay) was in a failed state.
  • A problem was fixed for a SRC B7006A84 PCIe link down event not being reported as a failed link for the I/O expansion drawer (F/C #EMX0) in the PCIe topology status in the Advanced System Manager Interface (ASMI) or on the management console.
  • A problem was fixed for the Live Partition Mobility (LPM) migration of virtual devices to a Power8 systems to update each virtual device location code correctly to reflect the location code in the target systems instead of the location code in the source system.  This problem prevented the management console from being able to look up AIX Object Data Manager (ODM) names for the virtual devices so that operations such as remove on the device could not be performed.
  • A problem was fixed for PCIe adapters requesting PCI I/O space that triggers a SRC BA1800007 error log.  This SRC should not have been logged since PC I/O spaces are not supported by Power8 systems.  The SRC log is now suppressed.
  • A problem was fixed for a processor core unit being deconfigured but not guarded for a SRC B113E504 processor error in host boot with fault isolation register (FIR) code "RC_PMPROC_CHKSLW_NOT_IN_ETR" that caused the CEC to go to termination.  By guarding the failed processor core, the fix insures the core is not used on the reIPL of the CEC.
  • A security problem was fixed in OpenSSL for memory leaks that allowed remote attackers to cause a denial of service (out of memory on the service processor). The Common Vulnerabilities and Exposures issue numbers are CVE-2014-3513 and CVE-2014-3567.
  • A security problem in GNU Bash was fixed to prevent arbitrary commands hidden in environment variables from being run during the start of a Bash shell.  Although GNU Bash is not actively used on the service processor, it does exist in a library so it has been fixed.  This is IBM Product Security Incident Response Team (PSIRT) issue #2211.  The Common Vulnerabilities and Exposures issue numbers for this problem are CVE-2014-6271, CVE-2014-7169, CVE-2014-7186, and CVE-2014-7187.
  • A problem was fixed to add failure recovery in the early boot of the service processor so that the boot is retried on failure instead of the service processing going unresponsive with SRC B1817212 on the operations panel.
  • A problem was fixed for isolating and repairing DIMM memory failures at the byte level without affecting other ranks of memory. This fix substantially reduces the FRU call outs of DIMMS for memory problems.
  • A security problem was fixed in OpenSSL where the service processor would, under certain conditions, accept Diffie-Hellman client certificates without the use of a private key, allowing a user to falsely authenticate .  The Common Vulnerabilities and Exposures issue number is CVE-2015-0205.
  • A security problem was fixed in OpenSSL to prevent a denial of service when handling certain Datagram Transport Layer Security (DTLS) messages.  A specially crafted DTLS message could exhaust all available memory and cause the service processor to reset.  The Common Vulnerabilities and Exposures issue number is CVE-2015-0206.
  • A security problem was fixed in OpenSSL to prevent a denial of service when handling certain Datagram Transport Layer Security (DTLS) messages.  A specially crafted DTLS message could do an null pointer de-reference and cause the service processor to reset.  The Common Vulnerabilities and Exposures issue number is CVE-2014-3571.
  • A security problem was fixed in OpenSSL to fix multiple flaws in the parsing of X.509 certificates.  These flaws could be used to modify an X.509 certificate to produce a certificate with a different fingerprint without invalidating its signature, and possibly bypass fingerprint-based blacklisting.  The Common Vulnerabilities and Exposures issue number is CVE-2014-8275.
  • A security vulnerability, commonly referred to as GHOST, was fixed in the service processor glibc functions getbyhostname() and getbyhostname2() that allowed remote users of the functions to cause a buffer overflow and execute arbitrary code with the permissions of the server application.  There is no way to exploit this vulnerability on the service processor but it has been fixed to remove the vulnerability from the firmware.  The Common Vulnerabilities and Exposures issue number is CVE-2015-0235.
  • A problem was fixed for an incorrect SRC logged for an unplugged cable to the PCIe I/O expansion drawer (F/C #EMX0).  A B7006A88 SRC was errantly logged that calls out the cable as bad hardware that needs to be replaced.  This is replaced with SRC B7006A82 that says a cable is unplugged to a PCIe FanOut module in the IO expansion drawer.
  • A problem was fixed for missing dump data for cores and L3 cache memory when there is core checkstop and deconfiguration of the core.
  • A problem was fixed for a false power supply fan failure with SRC 1100152F.  If the AC was interrupted to the power supply, the SRC 11001525 would have been logged for a bad fan with a call out of the power supply for replacement.
  • A problem was fixed for a partition deletion error on the management console with error code 0x4000E002 and message "...insufficient memory for PHYP".  The partition delete operation has been adjusted to accommodate the temporary increase in memory usage caused by memory fragmentation, allowing the delete operation to be successful.
  • A problem was fixed for disruptive firmware update to prevent false reference clock failures with SRC B1814805 and a hang in the IPL for the CEC.
  • A problem was fixed for a memory leak associated with the logging of SRC B1561311 for a bad voltage regulator module (VRM).
  • A problem was fixed for the processor module replacement process to prevent VPD corruption on the primary and redundant VPD chips on the new processor module.  This corruption resulted in the processor being unusable with HostBoot failing with unrecoverable errors (UEs) of SRCs BC8A090F and BC8A1701.
System firmware changes that affect certain systems
  • HIPER/Pervasive:Deferred:  On a system configured for a large number of PCIe adapters across multiple PCIe I/O expansion drawers (F/C #EMX0), a problem was fixed so that the PCIe adapters worked correctly in the system.  Previously, the PCIe interrupt servicing could deadlock, causing the PCIe adapter cards to become unresponsive.
  • For a system with Virtual Trusted Platform Module (VTPM) partitions,  a problem was fixed for a management console error that occurred while restoring a backup profile that caused the system to to go the management console "Incomplete state".  The failed system had a suspended VTPM partition and a B7000602 SRC logged.
  • For systems with IBMi partitions, a problem was fixed for the "5250 Application Capable" capability so it is passed to the IBMi partition as "True" if purchased.  For the problem, the capability was not sent to the partition and could cause extra performance to be missing for the "Fast Green Screen Performance" feature in IBMi.  There is a delay of up to 15 minutes after this fix is installed before it becomes active on the system.  If the updated capability property does not show up in the management console CEC properties as "True", this is a slowness in the refresh of the capability properties to the management console and not a problem with the fix.  To resolve this issue with the capability not displaying correctly, rebuild the managed system on the management console and then wait up to one hour for the CEC property capability "5250 Application Capable" to be updated to "True".
  • On a system with a Linux partition, a problem was fixed for the Linux "lsslot" command so that it is able to find the F/C EC41 and EC42 PCIe 3D graphics adapter installed in the CEC, instead of showing the slot as "empty".  The Linux graphics adapter worked correctly even though it showed as "empty".
  • On systems with a PCIe 3D graphics adapter (F/C #EC41 or #EC42) in a partition, a problem was fixed for a partition hang or BA21xxxx error conditions during partition initialization.
  • A problem was fixed for certain workloads that caused the system to enter safe mode (mode for running at minimum processor frequencies)  when the On-chip controllers (OCCs) did not get the Analog Power Subsystem Sweep (APSS)  frequency control data within the OCC time out period.  The time out for a OCC update has been increased so the OCC can tolerate periods of high bus use that slow down the APSS communication.
  • On a system with redundant service processors, a problem was fixed for bad pointer reference in the mailbox function during data synchronization between the two service processors.  The de-reference of the bad pointer caused a core dump, reset/reload, and fail-over to the backup service processor.
SC820_051_047 / FW820.03

01/27/15
Impact: Serviceability         Severity:  SPE

System firmware changes that affect all systems

  • A problem was fixed in concurrent firmware update to prevent the secondary service processor from going to a failed state.
  • A problem was fixed for the power supply fans to monitor both rotors instead of one to prevent a failure in one rotor from shutting down the power supply.
  • A problem was fixed for firmware updates to reduce the number of informational B181A85E SRCs for an expected SQL lock condition during a database transaction.  Previously, several thousand B181A85E SRC entries were created for the error log, slowing performance of the service processor and flooding the error log.
  • A problem was fixed for reset/reload failures caused by excessive synchronization of thermal management data with the redundant service processor.
  • A problem was fixed for failovers to the secondary service processor failing with SRC B1818601 caused by a bad data base object reference.

System firmware changes that affect certain systems

  • For a system with memory mirroring activated and a memory block size of 16 Megabytes, a problem was fixed for system dump that caused Hypervisor Real Mode Offset (HMRO) data structure corruption in the physical memory map.    This problem could cause concurrent firmware update failures or subsequent system dumps to be corrupted.
SC820_048_047 / FW820.02

12/01/14
Impact:  New      Severity:  New

New Features and Functions
  • GA Level

4.0 How to Determine The Currently Installed Firmware Level

You can view the server's current firmware level on the Advanced System Management Interface (ASMI) Welcome pane. It appears in the top right corner. Example: SC810_123.


5.0 Downloading the Firmware Package

Follow the instructions on Fix Central. You must read and agree to the license agreement to obtain the firmware packages.

Note: If your HMC is not internet-connected you will need to download the new firmware level to a USB flash memory device or ftp server.


6.0 Installing the Firmware

The method used to install new firmware will depend on the release level of firmware which is currently installed on your server. The release level can be determined by the prefix of the new firmware's filename.

Example: SCxxx_yyy_zzz

Where xxx = release level

Instructions for installing firmware updates and upgrades can be found at http://www-01.ibm.com/support/knowledgecenter/9119-MHE/p8ha1/updupdates.htm

IBM i Systems:

For information concerning IBM i Systems, go to the following URL to access Fix Central: 
http://www-933.ibm.com/support/fixcentral/

Choose "Select product", under Product Group specify "System i", under Product specify "IBM i", then Continue and specify the desired firmware PTF accordingly.

7.0 Firmware History

The complete Firmware Fix History for this Release Level can be reviewed at the following url:
http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/SC-Firmware-Hist.html