Power8 System Firmware

Applies to:   9119-MHE and 9119-MME.

This document provides information about the installation of Licensed Machine or Licensed Internal Code, which is sometimes referred to generically as microcode or firmware.


Contents


1.0 Systems Affected

This package provides firmware for Power System E880 (9119-MHE ) and Power System E870 (9119-MME) servers only.

The firmware level in this package is:

1.1 Minimum HMC Code Level

This section is intended to describe the "Minimum HMC Code Level" required by the System Firmware to complete the firmware installation process. When installing the System Firmware, the HMC level must be equal to or higher than the "Minimum HMC Code Level" before starting the system firmware update.  If the HMC managing the server targeted for the System Firmware update is running a code level lower than the "Minimum HMC Code Level" the firmware update will not proceed.

The Minimum HMC Code level for this firmware is:  HMC V8 R8.3.0 (PTF MH01513) with Mandatory efix (PTF MH01514).

Although the Minimum HMC Code level for this firmware is listed above,  HMC V8 R8.3.0 Service Pack 2 (PTF MH01584) with ifix (PTF MH01638) or higher is recommended.

For information concerning HMC releases and the latest PTFs,  go to the following URL to access Fix Central:
http://www-933.ibm.com/support/fixcentral/

For specific fix level information on key components of IBM Power Systems running the AIX, IBM i and Linux operating systems, we suggest using the Fix Level Recommendation Tool (FLRT):
http://www14.software.ibm.com/webapp/set2/flrt/home

NOTES:
                -You must be logged in as hscroot in order for the firmware installation to complete correctly.
                - Systems Director Management Console (SDMC) does not support this System Firmware level.

1.2 AIX iFix Required

For IBM Power System servers with the PCIe 2-port Async EIA-232 Adapter installed on AIX partitions, an AIX fix resolving the async port interrupt handling (APAR IV77596) must be installed before updating to the SC830_068 (FW830.10) or later level of firmware.  The ports on the adapter (feature code EN27/EN28, CCIN 57D4) may become un-usable with the installation of that firmware level due to an issue with how interrupts are handled.  Many JAS_RTS error log entries are written to the error log due to this issue.

Prior to this APAR shipping in a future Service Pack, AIX intends to publish ifixes for the latest Service Packs on all active Technology Levels on our ftp server, in ftp://aix.software.ibm.com/aix/ifixes/iv77596/ on or before Oct 13, 2015.  If you need an ifix other than the ones on this server, contact IBM support to request one for your specific situation.

The procedure is intended to be performed by the customer.  In the event that the customer has questions or concerns with the procedure, you should contact IBM Support.  Please contact IBM Support: 
US Support: 1.800.IBM.SERV
WW Support (select your country):  http://www.ibm.com/planetwide/

2.0 Important Information

Downgrading firmware from any given release level to an earlier release level is not recommended.

If you feel that it is necessary to downgrade the firmware on your system to an earlier release level, please contact your next level of support.

2.1 IPv6 Support and Limitations

IPv6 (Internet Protocol version 6) is supported in the System Management Services (SMS) in this level of system firmware. There are several limitations that should be considered.

When configuring a network interface card (NIC) for remote IPL, only the most recently configured protocol (IPv4 or IPv6) is retained. For example, if the network interface card was previously configured with IPv4 information and is now being configured with IPv6 information, the IPv4 configuration information is discarded.

A single network interface card may only be chosen once for the boot device list. In other words, the interface cannot be configured for the IPv6 protocol and for the IPv4 protocol at the same time.

2.2 Concurrent Firmware Updates

Concurrent system firmware update is only supported on HMC Managed Systems only.

2.3 Memory Considerations for Firmware Upgrades

Firmware Release Level upgrades and Service Pack updates may consume additional system memory.
Server firmware requires memory to support the logical partitions on the server. The amount of memory required by the server firmware varies according to several factors.
Factors influencing server firmware memory requirements include the following:
Generally, you can estimate the amount of memory required by server firmware to be approximately 8% of the system installed memory. The actual amount required will generally be less than 8%. However, there are some server models that require an absolute minimum amount of memory for server firmware, regardless of the previously mentioned considerations.

Additional information can be found at:
http://www-01.ibm.com/support/knowledgecenter/9119-MHE/p8hat/p8hat_lparmemory.htm


3.0 Firmware Information

Use the following examples as a reference to determine whether your installation will be concurrent or disruptive.

For systems that are not managed by an HMC, the installation of system firmware is always disruptive.

Note: The concurrent levels of system firmware may, on occasion, contain fixes that are known as Deferred and/or Partition-Deferred. Deferred fixes can be installed concurrently, but will not be activated until the next IPL. Partition-Deferred fixes can be installed concurrently, but will not be activated until a partition reactivate is performed. Deferred and/or Partition-Deferred fixes, if any, will be identified in the "Firmware Update Descriptions" table of this document. For these types of fixes (Deferred and/or Partition-Deferred) within a service pack, only the fixes in the service pack which cannot be concurrently activated are deferred.

Note: The file names and service pack levels used in the following examples are for clarification only, and are not necessarily levels that have been, or will be released.

System firmware file naming convention:

01SCxxx_yyy_zzz

NOTE: Values of service pack and last disruptive service pack level (yyy and zzz) are only unique within a release level (xxx). For example, 01SC830_040_040 and 01SC840_040_045 are different service packs.

An installation is disruptive if:

            Example: Currently installed release is 01SC830_040_040, new release is 01SC840_050_050.

            Example: SC830_040_040 is disruptive, no matter what level of SC830 is currently installed on the system.

            Example: Currently installed service pack is SC830_040_040 and new service pack is SC830_050_045.

An installation is concurrent if:

The release level (xxx) is the same, and
The service pack level (yyy) currently installed on the system is the same or higher than the last disruptive service pack level (zzz) of the service pack to be installed.

Example: Currently installed service pack is SC830_040_040, new service pack is SC830_071_040.

3.1 Firmware Information and Description

 
Filename Size Checksum
01SC830_097_048.rpm
77972144
23669

Note: The Checksum can be found by running the AIX sum command against the rpm file (only the first 5 digits are listed).
ie: sum 01SC830_097_048.rpm

SC830
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs

The complete Firmware Fix History for this Release Level can be reviewed at the following url:
http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/SC-Firmware-Hist.html
SC830_097_048 / FW830.30

08/24/16
Impact: Availability    Severity: SPE

New features and functions

  • The certificate store on the service processor has been upgraded to include the changes contained in version 2.6 of the CA certificate list published by the Mozilla Foundation at the mozilla.org website as part of the Network Security Services (NSS) version 3.21.
  • Support was added to the Advanced System Management Interface (ASMI) for the Intelligent Platform Machine Interface (IPMI) to be able to change the IPMI password.  On the "Login Profile/Change Password" menu, a user ID of "IPMI" can be selected.  Changing the password for IPMI changes the password for the default IPMI user ID.  IPMI is not a user ID for logging into ASMI.  The IPMI function on the service processor can be accessed using tool "ipmitool" from a client system that has a network connection to the service processor.
  • Support was added to protect the service processor from booting on a level of firmware that is below the minimum MIF level.  If this is detected, a SRC B18130A0 is logged.  A disruptive firmware update would then need to be done to the minimum firmware level or higher.  This new support has no effect on the system being updated with the service pack but has been put in place to provide an enhanced firmware level for the IBM field stock service processors.
  • Support was added for the Stevens6+ option of the internal tray loading DVD-ROM drive with F/C #EU13.  This is an 8X/24X(max) Slimline SATA DVD-ROM Drive.  The Stevens6+ option is a FRU hardware replacement for the Stevens3+.  MTM 7226-1U3 (Oliver)  FC 5757/5762/5763 attaches to IBM Power Systems and lists Stevens6+ as optional for Stevens3+.  If the Stevens6+  DVD drive is installed on the system without the required firmware support, the boot of an AIX partition will fail when the DVD is used as the load source.  Also, an IBM i partition cannot consistently boot from the DVD drive using D-mode IPL.  A SRC C2004130 may be logged for the load source not found error.

System firmware changes that affect all systems

  • DEFERRED:  A performance improvement was made by disabling the Hot/Cold Affinity (HCA) hardware feature which gathers memory usage statistics for consumption by partition operating system memory management algorithms.  The statistics gathering can, in rare cases, cause performance to degrade.  The workloads that may experience issues are memory-intensive workloads that have little locality of reference and thus cannot take advantage of hardware memory cache.  As a consequence, the problem occurs very infrequently or not at all except for very specific workloads in a HPC environment.  This performance fix requires an IPL of the system to activate it after it is applied.
  • A problem was fixed for the service processor going to the reset state instead of the termination state when the anchor card is missing or broken.  At the termination state, the Advanced System Management Interface (ASMI) can be used to collect failure data and debug the problem with the anchor card.
  • A problem was fixed for error log entries created by Hostboot not getting written to the error log in some situations.  This can cause hardware detected as failed by Hostboot to not get reported or have a call-home generated.  This problem will occur whenever Hostboot commits a recovered or informational error as its last error log in the current IPL.  In the next IPL,  one or more error logs from Hostboot will be lost.
  • A problem was fixed for the Hardware Management Console (HMC) "chpwrmgmt" command not providing a meaningful error message when used to try to enable an invalid power saver mode of "dynamic_favor_power" on the 9119-MME or 9119-MHE models.  This power saver mode is not available on these models but the error message issued was "HSCL1400 An error has occurred during the operation to the managed system. Try the task again."  The following is the corrected error message:  "HSCL1402 This operation failed due to the following reasons: HSCL02F3 The managed system does not support the specified power saver mode."
  • A problem was fixed for the health monitoring of the NVRAM and DRAM in the service processor that had been disabled.  The monitoring has been re-established and early warnings of service processor memory failure is logged with one of the following Predictive Error SRCs:  B151F107, B151F109, B151F10A, or B151F10D.
  • A  problem was fixed for an incorrect date in partitions created with a Simplified Remote Restart-Capable (SRR) attribute where the date is created as Epoch 01/01/1970 (MM/DD/YYYY).  Without the fix, the user must change the partition time of day when starting the partition for the first time to make it correct.  This problem only occurs with SRR partitions.
  • A problem was fixed for hypervisor task failures in adjunct partitions with a SRC B7000602 reported in the error log.  These failures occur during adjunct partition reboots for concurrent firmware updates but are extremely rare and require a re-IPL of the system to recover from the task failure.  The adjunct partitions may be associated with the VIOS or I/O virtualization for the physical adapters such as done for SR-IOV.
  • A problem was fixed for a shortened "Grace Period" for "Out of Compliance" users of a Power Enterprise Pool (PEP).   The "Grace Period" is short by one hour, so the user has one less hour to resolve compliance issues before the HMC disallows any more borrowing of PEP resources.  For example, if the "Grace Period" should have been 48 hours as shown in the "Out of Compliance" message, it really is 47 hours in the hypervisor firmware.  The borrowing of PEP resources is not a common usage scenario.  It is most often found in Live Partition Mobility (LPM) migrations where PEP resources are borrowed from the source server and loaned to the target server.
  • A problem was fixed for an AIX or Linux partition failing with a SRC B2008105 LP 00005 on a re-IPL after a dump (firmware assisted or error generated dump) following a Live Partition Mobility (LPM) migration operation.  The problem does not occur if the migrated partition completes a normal IPL after the migration.
  • A problem was fixed for intermittent long delays in the NX co-processor for asynchronous requests such as NX 842 compressions.  This problem was observed for AIX DB2 when it was doing hardware-accelerated compressions of data but could occur on any asynchronous request to the NX co-processor.
  • A problem was fixed for transmit time-outs on a Virtual Function (VF) during stressful network traffic, on systems using PCIe adapters in Single Root I/O Virtualization (SR-IOV) shared-mode.  This fix updates adapter firmware to 10.2.252.1918, for the following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EL38, EN0M, EN0N, EN0K, EN0L, and EL3C.
    The SR-IOV adapter firmware level update for the shared-mode adapters happens under user control to prevent unexpected temporary outages on the adapters.  A system reboot will update all SR-IOV shared-mode adapters with the new firmware level.  In addition, when an adapter is first set to SR-IOV shared mode, the adapter firmware is updated to the latest level available with the system firmware (and it is also updated automatically during maintenance operations, such as when the adapter is stopped or replaced).  And lastly, selective manual updates of the SR-IOV adapters can be performed using the Hardware Management Console (HMC).  To selectively update the adapter firmware, follow the steps given at the IBM Knowledge Center for using HMC to make the updates:   https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
    Note:  Adapters that are capable of running in SR-IOV mode, but are currently running in dedicated mode and assigned to a partition, can only be updated concurrently by the OS that owns the adapter.
  • A security problem was fixed in OpenSSL for a possible service processor reset on a null pointer de-reference during SSL certificate management. The Common Vulnerabilities and Exposures issue number is CVE-2016-0797.
  • A problem was fixed for missing dumps for  service processor failures during firmware updates.
  • A problem was fixed for a service processor failure during a system power off that causes a reset of the service processor.  The service processor is in the correct state for a normal system power on after the error.  The frequency for this error should be low as it is caused by a very rare race condition in the power off process.
  • A problem was fixed for a processor hang where the error recovery was not guarding the failing processor.  The failure causes a SRC B111E540 to be logged with Signature Description of " ex(n0p3c1) (COREFIR[55]) NEST_HANG_DETECT: External Hang detected".  With the fix, the failure processor FRU is called out and guarded so that the error does not re-occur when the system is re-IPLed.
  • A problem was fixed for a sequence of two or more Live Partition Mobility migrations that caused a partition to crash with a SRC BA330000 logged (Memory allocation error in partition firmware).  The sequence of LPM migrations that can trigger the partition crash are as follows:
    The original source partition level can be any FW760.xx, FW763.xx, FW770.xx, FW773.xx, FW780.xx, or FW783.xx P7 level or any FW810.xx, FW820.xx, FW830.xx, or FW840.xx P8 level.  It is migrated first to a system running one of the following levels:
    1) FW730.70 or later 730 firmware or
    2) FW740.60 or later 740 firmware
    And then a second migration is needed to a system running one of the following levels:
    1) FW760.00 - FW760.20 or
    2) FW770.00 - FW770.10
    The twice-migrated system partition is now susceptible to the BA330000 partition crash during normal operations until the partition is rebooted.  If an additional LPM migration is done to any firmware level, the thrice-migrated partition is also susceptible to the partition crash until it is rebooted.
    With the fix applied, the susceptible partitions may still log multiple BA330000 errors but there will be no partition crash.  A reboot of the partition will stop the logging of the BA330000 SRC.
  • A problem was fixed for the Advanced System Management Interface "Network Services/Network Configuration" "Reset Network Configuration" button that was not resetting the static routes to the default factory setting.  The manufacturing default is to have no static routes defined so the fix clears any static routes that had been added.  A circumvention to the problem is to use the ASMI "Network Services/Network Configuration/Static Route Configuration" "Delete" button before resetting the network configuration.
  • A problem was fixed for a partial callout for a failed SPIVID (Serial Peripheral Interface Voltage Identification) interface on the power supply VRM (Voltage Regulator Module).  The SPVID interface allows the processor to to control it's external voltage supply level, but if it fails, only the processor FRU (SCM) is called out but not the VRM.
    The system IPL will complete with a CEC drawer deconfigured.  The error log will only contain the processor but not the defective processor VRM.  Hostboot does not detect a SPIVID error, but fails on a SCOM operation to the processor chip.  The errors show up with SRC BCxx090F logged by Hostboot and word 7 containing  one of three error values for a SPIVID_SLAVE_PART callout:
    1) RC_SBE_SET_VID_TIMEOUT = 0x005ec1b2
    2) RC_SBE_SPIVID_STATUS_ERROR = 0x00902aac
    3) RC_SBE_SPIVID_WRITE_RETURN_STATUS_ERROR = 0x0045d3cd with HWP Error description : "Procedure: proc_sbe_setup_evid SPIVID Device did not return good status the Boot Voltage Write operation" and HWSV RC of BA24.
    Without the fix, replace both the identified SCM and the associated VRM.
  • A problem was fixed for the HMC Exchange FRU procedure for DVD drive with MTM 7226-1U3 and feature codes 5757/5762/5763 where it did not verify the DVD drive was plugged in at the end of the exchange procedure.  Without the fix,  the user must manually verify that the DVD drive is plugged in.
  • A problem was fixed for the Advanced System Mangement Interface (ASMI) incorrectly showing the Anchor card as guarded whenever any redundant VPD chip is guarded.

System firmware changes that affect certain systems

  • A problem was fixed for the service processor recovery from intermittent MAX31760 fan controller faults logged with SRC B1504804.  The fan controller faults caused an out of memory condition on the service processor, forcing it to reset and failover to the backup service processor with SRCs B181720D, B181E6E9,  and B182951C logged.  With the fix, the fan controller faults are handled without memory loss and the only SRC logged is B1504804 for each fan controller fault.
  • On systems with a PowerVM Active Memory Sharing (AMS) partition with AIX  Level 7.2.0.0 or later with Firmware Assisted Dump enabled, a problem was fixed for a Restart Dump operation failing into KDB mode.  If "q" is entered to exit from KDB mode, the partition fails to start.  The AIX partition must be powered off and back on to recover.  The problem can be circumvented by disabling Firmware Assisted Dump (default is enabled in AIX 7.2).
  • For a system partition with more than 64 cores, a problem was fixed for Live Partition Mobility (LPM)  migration operations failing with HSCL365C.  The partition migration is stopped because the platform detects a firmware error anytime the partition has more than 64 cores.
SC830_093_048 / FW830.22

06/28/16
Impact: Availability    Severity: SPE

Critical firmware update for FW830.21 (SC830_092) level systems

System IPLed with FW830.21:  A critical firmware update is required for all 9119-MME and 9119-MHE systems that have been IPLed with FW830.21 (SC830_092). The FW830.21 level can cause a failed IPL or a potential unplanned outage. If the server is already in production, then customer should plan an outage at a convenient time to apply FW 830.22 (SC830_093) or higher and IPL.

System had FW830.21 concurrently applied:  If firmware level FW830.21 was concurrently installed (i.e. system was NOT IPL'ed after installing the level) customers are not impacted by this issue provided they apply FW830.22 (SC830_093) or higher prior to next planned system reboot. NOTE: FW 830.22 can be applied concurrently.

System IPLed with any other version of Firmware:  If the current firmware level of the system is not FW830.21, the system is not exposed to this issue. Customers can install this level or later at the next scheduled update window.

To verify the firmware level installed on the server, select “Updates” from the left side of the HMC and place a check mark on the server of interest. Then select “View system information” from the bottom view, select “None - Display current values”. The Platform IPL Level will indicate the last level the system was booted on.

System firmware changes that affect all systems

  • A problem was fixed for an intermittent failure in Hostboot during the system IPL resulting in SRCs BC70090F and BC8A1701 logged with a hardware procedure return code of "RC_PROC_BUILD_SMP_ADU_STATUS_MISMATCH".  The system terminates with a Terminate Immediate (TI) condition.  The system must be re-IPLed to recover.  The failure is very infrequent and was caused by a race condition introduced as part of clock card failure data collection procedure which has now been corrected.
SC830_092_048 / FW830.21

06/01/16
Impact: Availability    Severity: SPE

System firmware changes that affect all systems

  • Support for additional First Failure Data Capture (FFDC) data for processor clock failover errors provided by creating daily clock status reports with SRC B150CCDA informational error logs.  This clock status SRC log is written into the Hardware Management Console (HMC) iqyylog.log as a platform error log (PEL) event.  The PEL event contains a dump of the clock registers.  If a processor clock fail over with SRC B158CC62 occurs on the service processor, the iqyylog.log file on the HMC should be collected to help debug the clock problem using the B150CCDA data.
  • A problem was fixed for a missing error log when a clock card fails over to the backup clock card.  This problem causes loss of redundancy on the clock cards without a callout notification that there is a problem with the FRU.  If the fix is applied to a system that had a failed clock, that condition will not be known until the system is IPLed again when a error log and callout of the clock card will occur if it is in a persisted failed state.
  • On systems using PowerVM firmware with dedicated processor partitions,  a problem was fixed for the dedicated processor partition becoming intermittently unresponsive. The problem can be circumvented by changing the partition to use shared processors.  This is a follow-on to the fix provided in 830.20 for a different issue for delays in dedicated processor partitions that were caused by low I/O utilization.
  • A problem was fixed for a secondary clock card (CCIN 6B49 ) failure on the system control unit (SCU) being called out as a local clock card (CCIN 6B2D) failure on the node with SRC B158E504.  For this failure to occur, the primary clock card on the SCU must have been previously failed and guarded.
SC830_086_048 / FW830.20

04/01/16
Impact: Availability    Severity: SPE

New features and functions

  • Support was added to the Advanced System Management Interface (ASMI) to be able to add a IPv4 static route definition for each ethernet interface on the service processor.  Using a static route definition,  a Hardware Management Console (HMC) configured on a private subnet that is different from the service processor subnet is now able to connect to the service processor and manage the CEC.  A static route persists until it is deleted or until the service processor settings are restored to manufacturing defaults.  The static route is managed with the ASMI panel "Network Services/Network Configuration/Static Route Configuration" IPv4 radio button.  The "Add" button is used to add a static route (only one is allowed for each ethernet interface) and the "Delete" button is used to delete the static route.
  • Support was added to the Advanced System Management Interface (ASMI) to display the environmental info section of error logs in the "System Service Aids-> Error->Event logs" panel.  The following is an example of the information displayed:
    |------------------------------------------------------
    |                              Environmental Info      
    |------------------------------------------------------
    | Section Version          : 1                         
    | Sub-section type         : 0                        
    | Created by               : powr                                   
    | Genesis Record Time-Stamp: 03/12/2015 15:31:21
    | Genesis Corr-Resistance  : 4.687847
    | Genesis Ambient-Temp(C)  : 28.000000
    | Genesis Corrosion-Rate   : 0           
    |                                                       
    | Corrosion Rate Status    : 1             
    | Presence of UsrDataSec   : 1
    | Num Corrosion Readings   : 1        
    |                                                      
    | Daily Corr-Resistance    : 4.804206          
    | Daily Ambient-Tempr(C)   : 35.312500      
    | Daily Corrosion-Rate     : 12C                  
    |------------------------------------------------------

System firmware changes that affect all systems

  • A problem was fixed for a power fault on a single node with SRC 11002610 that terminates the multi-node system.  The problem can be circumvented by unplugging the failing node and the system will IPL.  With the fix, the failing node is guarded on the power fault and the rest of the system is able to IPL.
  • A problem was fixed for Advanced System Management Interface (ASMI) TTY to allow "admin" passwords to be greater than eight characters in length to be consistent with prior generations of the product.  The ASMI web interface works correctly for user "admin" passwords with no truncation in the length of the passwords.
  • A problem was fixed for the recovery of a failing PCI clock so that a failover to the backup PCI clock occurs without a node failing and being deconfigured.  Without the fix, the PCI clock does not behave as a redundant FRU and faults on it will cause the CEC to terminate.  A re-IPL of the CEC recovers it from the PCI clock error with the bad clock guarded so that the other PCI clock is used,
  • A problem was fixed for an intermittent IPL failure with SRC B181E6C7 for a deadlock condition when testing the clocks during the IPL.  The problem state can be recovered by doing another IPL.  The problem is triggered by an error in the IPL clock test causing a interrupt handler to switch to the redundant clock and deadlock.  With the fix, the clock fault is handled and the bad clock is guarded, with the IPL completing on the redundant clock.
  • A problem was fixed for a system IPL hang at C100C1B0 with SRC 1100D001 when the power supplies have failed to supply the necessary 12-volt output for the system.   The 1100D001 SRC was calling out the planar when it should have called out the power supplies.  With the fix, the system will terminate as needed and call out the power supply for replacement.  One mode of power supply failure that could trigger the hang is sync-FET failures that disrupt the 12-volt output.
  • A problem was fixed for recovery from PNOR flash memory corruption that causes the IPL to fail with SRC D143900C.  This is very rare and only has happened in IBM internal labs.  Without the fix, the service processor cannot correct the corruption in the PNOR.  If a system has the problem SRC and cannot IPL,  then that system must be disruptively firmware updated to apply the fix to be able to IPL again.
  • A problem was fixed for a PCIe3 I/O expansion drawer (#EMX0) not getting all error logs reported when its error log queue is full.  In the case where the error log queue is full with 16 entries, only one entry is returned to the hypervisor for reporting.  This error log truncation only occurs during periods of high error activity in the expansion drawer.
  • A problem was fixed for recovering from a misplug of the service processor FSI cables (U2-P1-C10-T2 and U1-P1-C9-T2) where the plug locations are reversed from what would be a proper connection.  Without the fix, the bad FSI connections cause the service processors to go to the service processor stop state.  With the fix applied, the error logs call out the bad cables so they can be repaired and the service processor remains in a working state.
  • A problem was fixed for hardware system dump collection after a hardware checkstop that was missing scan ring data.  This is a very infrequent problem caused by an error with timing in the multi-threaded dump collection process.  Until this fix is applied, the debug of some hardware dump problems may require doing multiple dump collections to get all the data.
  • A problem was fixed for an Advanced System Management Interface (ASMI) error that occurred when trying to display detail on a deconfigured Anchor Card VPD.  If the error log for the selected deconfiguration record had been deleted, it caused ASMI to core dump.  With the fix,  if the error log for deconfiguration record is missing, the error log details such as failing SRC for the deconfiguration record are returned as blank.
  • A problem was fixed for an On-Chip Controller error with SRC B1702AC4 that was logged as a unrecoverable without hardware callouts.  This occurred  when the slave OCC failed to receive any Analog Power Subsystem Sweep (APSS) data over a long time interval.  With the fix, if the OCC fails in the same manner, the error is predictive with hardware callouts in the error log.
  • A problem was fixed in the Advanced System Management Interface (ASMI) for a FRU exchange of a DVD where the DVD was not being powered off as needed for the exchange.  The missing power off of the FRU could cause a data read or write error if the DVD is in use when the DVD is removed.  With the fix, the ASMI deactivate DVD button turns off the DVD green power LED during the exchange procedure, so it is known when it is safe to continue with the exchange procedure steps and remove the DVD.
  • A problem was for fixed so that error logs are now generated for thermal errors detected by the service processor.  Without the fix, thermal errors such as a temperature over the threshold will not get reported in the error log but higher fan speeds will be present as an indicator of the thermal problem.  Until the fix is applied, the error log and call home mechanism cannot be relied on to monitor for system thermal problems.
  • A problem was fixed for processor core checkstops that cause an LPAR outage but do not create hardware errors and service events.  The processor core is deconfigured correctly for the error.  This can happen if the hypervisor forces processor checkstops in response to excessive processor recovery.
  • A problem was fixed for the callout of a VPD collection fault and system termination with SRC 11008402 to include the 1.2vcs VRM FRU.  The power good fault fault for the 1.2 volts would be a primary cause of this error.  Without the fix, the VRM is missing in the callout list and only has the VPDPART isolation procedure.
  • A problem was fixed for excessive logging of the SRC 11002610 on a power good (pgood) fault when detected by the Digital Power Subsystem Sweep (DPSS).  Multiple pgood interrupts are signaled by the DPSS in the interval between the first pgood failure and the node power down.  A threshold was added to limit the number of error logs for the condition.
  • A problem was fixed for redundant logging of the SRC B1504804 for a fan failure, once every five seconds.  With the fix, the failure is logged only at the initial time of failure in the IPL.
  • A problem was fixed to speed up recovery for VPD collection time-out errors for PCIe resources in an I/O drawer logged with SRC 10009133 during concurrent firmware updates.  With the fix, the hypervisor is notified as soon as the VPD collection has finished so the PCIe resources can report as available .  Without the fix, there is a delay as long as two hours for the recovery to complete.
  • A problem was fixed for a false unrecoverable error (UE) logged for B1822713 when an invalid cooling zone is found during the adjustment of the system fan speeds.  This error can be ignored as it does not represent a problem with the fans.
  • A problem was fixed for a processor clock failover error with SRC B158CC62 calling out all processors instead of isolating to the suspect processor.  The callout priority correctly has a clock and a procedure callout as the highest priority, and these should be performed first to resolve the problem before moving on to the processors.
  • A problem was fixed for loss of back-level protection during firmware updates if an anchor card has been replaced.  The Power system manufacturing process sets the minimum code level a system is allowed to have for proper operation.  If a anchor card is replaced, it is possible that the replacement anchor card is one that has the Minimum MIF Level (MinMifLevel) given as "blank",  and this removes the system back-level protection. With the fix, blanks or nulls on the anchor card for this field are handled correctly to preserve the back-level protection.  Systems that have already lost the back-level protection due to anchor card replacement remain vulnerable to a accidental downgrade of code level by operator error, so code updates to a lower level for these systems should only be performed under guidance from IBM Support.  The following command can be run the Advanced Management Management Interface (ASMI) to determine if the system has lost the back-level protection with the presence of "blanks" or ASCII 20 values for MinMifLevel:
    "registry -l cupd/MinMifLevel" with output:
    "cupd/MinMifLevel:
    2020202020202020 2020202020202020 [ ]
    2020202020202020 2020202020202020 [ ]"
  • A problem was fixed for a system checkstop caused by a L2 cache least-recently used (LRU) error that should have been a recoverable error for the processor and the cache.  The cache error should not have caused a L2 HW CTL error checkstop.
  • A problem was fixed that was corrupting the Update Access Key (UAK) date with a corrupted date of "1900".   The user should correct the UAK date, if needed, to allow the firmware update to proceed, by using the original UAK key for the system.  On the Management Console,  enter the original update access key via the "Enter COD Code" panel. Or on the Advanced System Manager Interface (ASMI),  enter the original update access key via the "On Demand Utilities/COD Activation" panel.
  • A problem was fixed for PCIe switch recovery to prevent a partition switch failure during the IPL with error logs for SRC B7006A22 and B7006971 reported.  This problem can occur when doing recovery for an informational error on the switch.  If this problem occurs, the partition must be restarted to recover the affected I/O adapters.
  • A problem was fixed to correct the error messages for early failures in the Live Partition Mobility (LPM) migration of a partition.  The management console might report an unrelated error such as  "HSCLA27E The operation to lock the physical device location for target adapter" when the actual error might be not enough available memory on the target CEC to run the migration.  With the fix, the correct error code is returned so there is enough information to correct the error and retry the migration.
  • A problem was fixed for a hypervisor task hang during a FRU exchange on the PCIe3 I/O expansion drawer (#EMX0) that requires the entire drawer to power off and power on again.  The activation phase for the power on may never complete if a very rare sequence of events occurs during the power on step.  The FRUs to exchange that would cause the expansion drawer to power off  and power on are the following:  midplane, I/O module, I/O module VRM, chassis management card (CMC), cable card, and active optical cable.
  • A problem was fixed for PCIe adapter hangs and network traffic error recovery during Live Partition Mobility (LPM) and SR-IOV vNIC (virtual ethernet adapter)  operations.  An error in the PCI Host Bridge (PHB) hardware can persist in the L3 cache and fail all subsequent network traffic through the PHB.  The PHB  error recovery was enhanced to flush the PHB L3 cache to allow network traffic to resume.
  • A problem was fixed for a network boot/install failure using bootp in a network with switches using the Spanning Tree Protocol (STP).  A network boot/install using lpar_netboot on the management console was enhanced to allow the number of retries to be increased.  If the user is not using lpar_netboot, the number of bootp retries can be increased using the SMS menus.  If the SMS menus are not an option, the STP in the switch can be set up to allow packets to pass through while the switch is learning the network configuration.
  • A problem was fixed for a hypervisor adjunct partition failed with "SRC B2009008 LP=32770" for an unexpected SR-IOV adapter configuration.  Without the fix, the system must be re-IPLed to correct the adjunct error.  This error is infrequent and can only occur if an adapter port configuration is being changed at the same time that error recovery is occurring for the adapter.
  • A problem was fixed for recovering from FSI interrupt overruns (too many FSI interrupts at one time that cause the service processor to go interrupt-bound and get stuck in a loop) that caused the service processor to go to a failed state with SRC B1817212 on systems with a single service processor.  On systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.  With the fix, the FSI interrupt generation is reset if a threshold is exceeded, allowing the service processor to continue normal processing.  The failure trigger is a rare hardware fault condition that does not persist in the service processor.
  • A problem was fixed for priority callouts for system clock card errors with SRC B158CC62.  These errors had high priority callouts for the system clock card and medium callouts for FRUs in the clock path.  With the fix, all callouts are set to medium priority as the clock card is not the most probable FRU to have failed but is just a candidate among the many FRUs along the clock path.
  • A problem was fixed for a degraded PCI link causing a processor core to be guarded if a non-cacheable unit (NCU) store time-out occurred with SRC B113E540 and PRD signature  "(NCUFIR[9]) STORE_TIMEOUT: Store timed out on PB".  With the fix, the processor core is not guarded for the NCU error.  If this problem occurs and a core is deconfigured. clear the guard record and re-IPL to regain the processor core.  The solution for degraded PCI links is different from the fix for this problem, but a re-IPL of the CEC or a reset of the PCI adapters could help to recover the PCI links from their degraded mode.
  • A problem was fixed for a L2 cache error on the service processor that caused the service processor to reset or go to a failed state with SRC B1817212 on systems with a single service processor.  On systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.  With the fix, the L2 cache error is handled with single-bit corrected with no error to the service processor, so it can continue normal processing.  The L2 cache data error that causes this fail is infrequent and the service processor requires its limit of three resets in fifteen minutes to be exceeded for the service processor to fail, so service processor failure rate for this problem is low.
  • A problem was fixed for an incorrect reduction in FRU callouts for Processor Run-time Diagnostic (PRD) errors after a reference oscillator clock (OSCC) error has been logged.  Hardware resources are not called out and guarded as expected.  Some of the missing PRD data can be found in the secondary SRC of B181BAF5 logged by hardware services.  The callouts that PRD would have made are in the user data of that error log.
  • A problem was fixed for error recovery from failed Live Partition Mobility (LPM) migrations.  The recovery error is caused by a partition reset that leaves the partition in an unclean state with the following consequences:  1) A retry on the migration for the failed source partition may not not be allowed; and 2) With enough failed migration recovery errors, it is possible that any new migration attempts for any partition will be denied.  This error condition can be cleared by a re-IPL of the system. The partition recovery error after a failed migration  is much more likely to occur for partitions managed by NovaLink but it is still possible to occur for Hardware Management Console (HMC) managed partitions.
  • A problem was fixed for a Qualys network scan for security vulnerabilities causing a core dump in the Intelligent Platform Management Interface (IPMI)  process on the service processor with SRC B181EF88.  The error occurs anytime the Qualys scan is run because it sends an invalid IPMI session id that should have been handled and discarded without a core dump.
  • A security problem was fixed in the lighttpd server on the service processor, where a remote attacker, while attempting authentication, could insert strings into the lighttpd server log file.  Under normal operations on the service processor, this does not impact anything because the log is disabled by default.  The Common Vulnerabilities and Exposures issue number is CVE-2015-3200.
  • A security problem was fixed in OpenSSL for a possible service processor reset on a null pointer de-reference during RSA PPS signature verification. The Common Vulnerabilities and Exposures issue number is CVE-2015-3194.
  • A problem was fixed to guard a failed processor core to allow the system to IPL.  The processor core chiplet FRU was failing to be called out and guarded on a RC_PMPROC_CHKSLW_ADDRESS_MISMATCH error and this prevented the system from being able to IPL.

System firmware changes that affect certain systems

  • On multi-node systems with a power fault, a problem was fix for On-Chip Controller errors caused by the power fault being reported as predictive errors for SRC B1602ACB.  These have been corrected to be informational error logs.  If running without the fix, the predictive and unrecoverable errors logged for the OCC on loss of power to the node can be ignored.
  • On a multi-node system,  a problem was fixed for a power fault with SRC 11002610 having incorrect FRU callouts.  The wrong second FRU callout is made on nodes 2, 3, and 4 of a multi-node system.  Instead of calling out the processor FRU, the enclosure FRU is called out.  The first FRU callout is correct.
  • On PowerVM systems with dedicated processor partitions with low I/O utilization, the dedicated processor partition may become intermittently unresponsive. The problem can be circumvented by changing the partition to use shared processors.
  • On systems where memory relocation (as done by using Live Partition Mobility (LPM)) and a partition reboot are occurring simultaneously, a problem for a system termination was fixed.  The potential for the problem existed between the active migration and the partition reboot.
  • On a system running a IBM i partition,  a problem was fixed for a machine check incorrectly issued to an IBM i partition running 7.2 or later with 4K sector disks.  This problem only pertains to the IBM Power System S814 (8286-41A) , S824 (8286-42A), E870 (9119-MME), and E880 (9119-MHE) models.
  • A problem was fixed that limited Virtual Functions (VFs) to a maximum of 50 on a single PCIe3 10GbE  adapter (feature codes #EN15, #EN16, #EN17, and #EN18; and CCINs 2CE3 and 2CE4) when 64 should have been allowed.  This problem only occurs for two of the SR-IOV capable slot locations in the Power Systems:  slot C4 in the PCIe3 I/O expansion drawer (#EMX0) and slot C7 in the Power System E850 (8408-E8E).
  • A problem was fixed for an extraneous PCIe switch SRC B7006A22 being called out when there is a valid PCIe  expansion drawer cable problem with SRC B7006A88 reported.  The callout for SRC B7006A22 should be ignored as the PCIe switch hardware is working for this case.
  • On a system with a AIX partition and a Linux partition, a problem was fixed for dynamically moving an adapter that uses DMA from the Linux partition to the AIX partition that caused the AIX to fail by going into KDB mode (0c20 crash).  The management console showed the following message for the partition operation:  "Dynamic move of I/O resources failed.  The I/O slot dynamic partitioning operation failed.".  The error was caused by Linux using 64K mappings for the DMA window and AIX using 4K mappings for the DMA window, causing incorrect calculations on the AIX when it received the adapter.  Until the fix is applied, the adapters that use DMA should only be moved from Linux to AIX when the partitions are powered off.  This problem does not pertain to Power System S812L(8247-21L), S822L(8247-22L), and S824L(8247-42L) models.
  • A problem was fixed for a Live Partition Mobility migration failure of a time reference partition (TRP) to a FW830 system when setting partition hibernate capable "false".  This happens any time the TRP partition is attempted to be migrated.  To circumvent the problem, set the partition's Time Reference Property to disabled and retry the migration.
  • On systems with a partition using Active memory Sharing (AMS), a problem was fixed for a Live Partition Mobility (LPM) migration of the AMS partition that can hang the hypervisor on the target CEC.  When an AMS partition migrates to the target CEC, a hang condition can occur after processors are resumed on the target CEC, but before the migration operation completes.  The hang will prevent the migration from completing, and will likely require a CEC reboot to recover the hung processors.  For this problem to occur, there needs to be memory page-based activity (e.g. AMS dedup or Pool paging) that occurs exactly at the same time that the Dirty Page Manager's PSR data for that page is being sent to the target CEC.
  • On systems with an invalid P-side or T-side in the firmware, a problem was fixed in the partition firmware Real-Time Abstraction System (RTAS) so that system Vital Product Data (VPD) is returned at least from the valid side instead of returning no VPD data.   This allows AIX host commands such as lsmcode, lsvpd, and lsattr that rely on the VPD data to work to some extent even if there is one bad code side.  Without the fix,  all the VPD data is blocked from the OS until the invalid code side is recovered by either rejecting the firmware update or attempting to update the system firmware again.
  • On systems using PCIe adapters in SR-IOV mode, a problem was fixed for occasional B200F011 and B2009008 SRCs that can occur during an IPL, moving a adapter into SR-IOV mode, or with SR-IOV link up/down activity.
  • On systems using PCIe adapters in SR-IOV mode,  the following problems were addressed with a Broadcom Limited (formerly known as Avago Technologies and Emulex) adapter firmware update to 10.2.252.1905:  1) Eliminating virtual function (VF) transmit errors during VF resets and 2) Preventing  loss of legacy flow control when an adapter port is connected to a priority flow control (PFC) capable switch.
  • On systems with a AIX or Linux encapsulated state partitions, a problem was fixed for a Live Partition Mobility migration failure for the encapsulated state partitions.  The migration fails on the target CEC when the associated paging space needed to support the encapsulated state is not available.  Removing the "Encapsulated State" attribute from the partition would allow the migration to succeed.  However, removing this attribute can only be accomplished if the partition in the powered off state.  Encapsulated State partitions are needed for the remote restart feature.  An encapsulated state partition is a partition in which the configuration information and the persistent data are stored external to the server on persistent storage.  A partition that supports remote restart can be restarted remotely.  For more information on the remote start feature, refer to this IBM Knowledge Center link: http://www.ibm.com/support/knowledgecenter/P8DEA/p8efd/p8efd_lpar_general_props.htm
  • Support was added to eliminate the yearly Utility COD renewal on systems using Utility COD.  The Utility COD usage is already monitoring to make sure systems are running within the prescribed threshold limit of unreported usage, so a yearly customer renewal is not needed to manage the Utility COD processor usage.
SC830_075_048 / FW830.11

11/11/15
Impact: Availability    Severity: HIPER

System firmware changes that affect all systems

  • HIPER/Pervasive:  A problem was fixed for recovering from embedded MultiMediaCard (eMMC) flash NAND errors that caused the service processor to go to a failed state with SRC B1817212 on systems with a single service processor.  On systems with redundant service processors, the failed service processor would get guarded with a B151E6D0 or B152E6D0 SRC depending on which service processor fails.
  • HIPER/Pervasive: A problem associated with workloads using transactional memory on PowerVM was discovered and is fixed in this service pack. The effect of the problem is non-deterministic but may include undetected corruption of data.
  • DEFERRED:  A problem was fixed for memory on-die termination (ODT) settings to improve the signal integrity of the memory channel.
  • A problem was fixed for recovery from unaligned addresses for MSI interrupts from PCIe adapters.  The recovery prevents an adapter timeout caused by resource exhaustion.  With the fix, the resources for each bad interrupt are returned, allowing the PCIe adapter to continue to run for the normal traffic.
  • A problem was fixed for an Operations Panel SRC of B1504804 with no FRU callout.  A callout of the failed hardware has been added.
  • A problem was fixed to prevent recoverable power faults of short duration from causing the system to lose power supply redundancy.  Without the fix, the faulted state persisted for the recovered power fault, causing a problem with a system power off if other power supplies were lost at a later time.
  • A problem was fixed for a PCIe3 I/O expansion drawer (#EMX0) link failure with SRC B7006A8B .  The settings for the continuous time linear equalizers (CTLE) were adjusted to improve the incoming signal strength to improve the stability of the links.  The expansion drawer must be power cycled or the CEC can be re-IPLed for the fix to activate.
  • A problem was fixed for recovery from a processor local bus (PLB) hang on the service processor.  The errant PLB hang recovery would be seen in concurrent firmware updates that, on rare occasions, fail to do a side switch to activate to the new level of firmware.  On the management console, the error message would be HSCF010180E Operation failed ... E302F873 is the error code."  Other than the failed code level activation, the firmware update is successful.  If this problem occurs, the system can be set to the new firmware level by doing a power off from the management console and then doing a power on with side switch selected in the advanced properties.

System firmware changes that affect certain systems

  • A problem was fixed for the System Feature Code for the E870 (9119-MME) being displayed as "EPBB" by IBM i "DSPSYSVAL QPRCFEAT"  when it should be "EPBA".  This created a problem for certain IBM i software packages whose license was tied to the System Feature Code.  This fix has a concurrent activation.  For FW830.10, a similar, non-concurrent fix for the feature codes was made but the System Feature Code, as seen in IBM i  partitions, did not update immediately.
SC830_068_048 / FW830.10

09/10/15
Impact: Availability    Severity: HIPER

New features and functions

  • The firmware code update process was enhanced with a feature to block a firmware "downgrade" to a level that is below the system's manufactured code level.

System firmware changes that affect all systems

  • HIPER/Pervasive:DEFERRED:  A problem was fixed for a TCP/IP performance degradation on PCIe ethernet adapters with Remote Direct Memory Access (RDMA) over Converged Ethernet (RoCE).  By adjusting the system memory caching, a significant improvement was made to the data throughput speed to restore performance to expected levels.  This fix requires a system re-IPL to take effect.  This problem affects the E850 (8408-E8E), E870 (9119-MME), and E880 (9119-MHE) systems.
  • HIPER/Pervasive:  A problem was fixed for an ethernet adapter hanging on the service processor.  This hang prevents TCP/IP network traffic from the managment console and the Advanced System Management Interface (ASMI) browsers.  It makes it appear as if the service processor is unresponsive and can be confused with a service processor in the stopped state..  An A/C power cycle would recover a hung ethernet adapter.
  • HIPER/Pervasive:  A problem was fixed for missing the interrupts for processor local bus (PLB) time-outs..  This problem could hang the service processor or cause it to panic with a reset/reload of the service processor.  There is a possibility the reset of the service processor could take it to a stopped state where the service processor would be unresponsive.  In the service processor stopped state, any active partitions will continue to run but they will not be able to be managed by the management console.  The partitions can be allowed to run until the next scheduled service window at which time the service processor can be recovered with an AC power cycle or a pin-hole reset from the operator panel.
  • HIPER/Pervasive:  A problem was fixed for a system reset to clear the boot registers to prevent the reset from being mishandled as chip reset.   If a "system reset" is misinterpreted as a "chip reset", the boot of the service processor can go inadvertently to a stopped state and be unresponsive.  Pin-hole resets from the operations panel could also fail to the service processor stopped state.  In the service processor stopped state, any active partitions will continue to run but they will not be able to be managed by the management console.  The partitions can be allowed to run until the next scheduled service window at which time the service processor can be recovered with an AC power cycle or a pin-hole reset from the operator panel.
  • HIPER/Pervasive:  A problem was fixed so a corrupted file system partition table can be recovered and not have the service processor lose the ability to do P and T-side switches.  In error recovery situations, the loss of the side-switch option could present itself as an unresponsive service processor if it was needed to prevent a failure to the service processor stopped state.
  • HIPER/Pervasive:  A problem was fixed for a runaway interrupt request (IRQ) condition that caused the service processor to go to a stopped state.  In the service processor stopped state, any active partitions will continue to run but they will not be able to be managed by the management console.  The partitions can be allowed to run until the next scheduled service window at which time the service processor can be recovered with an AC power cycle or a pin-hole reset from the operator panel.
  • HIPER/Pervasive:  A problem was fixed for a dump partition full condition that caused the service processor to go to a stopped state.  In the service processor stopped state, any active partitions will continue to run but they will not be able to be managed by the management console.  The partitions can be allowed to run until the next scheduled service window at which time the service processor can be recovered with an AC power cycle or a pin-hole reset from the operator panel.
  • DEFERRED:  A problem was fixed for a PCIe3 I/O expansion drawer (#EMX0) link failure with SRC B7006A8B .  Data packet send retries were increased and link recovery was enabled to improve the stability of the links.  The CEC must be re-IPLed for the fix to activate.
  • A problem was fixed for a SRC 11002613 logged during a concurrent repair of a power supply.  This SRC was erroneously logged and did not represent a real problem.
  • A problem was fixed for an intermittent SRC B1504804 logged on a re-ipl of the CEC but that did not result in an IPL failure.
  • A problem was fixed for the capture of the registers for the Hostboot Self-Boot Engine (SBE) for SBE failures.  These registers had been missing from failure data for SBE failures, making these problems more difficult to debug.
  • A problem was fixed to remove an unnecessary delay in the system IPL to reduce the time needed to IPL by 30 seconds.
  • A problem was fixed for an unneeded error log with SRC B181DB04 that occurred in a failed IPL for a normal condition of lost PNOR flash access after a reIPL process had started and taken over the access.
  • A problem was fixed for an Advanced System Manager Interface (ASMI) error message of "Error in function 'connect", error code 111" when a browser attempted to connect before the service processor was ready.  The browser connection through the web server is now held off until the ASMI process is ready after a reset of the service processor or a AC power cycle of the system.
  • A problem was fixed for an incorrect call home for SRC B1818A0F.  There was no real problem so this call home should have been ignored.
  • A problem was fixed for a dump reIPL that failed with SRC B1818601 and B181460B after processor checkstops had terminated the system.
  • A problem was fixed for an infrequent service processor database corruption during concurrent firmware update that caused the system to terminate.
  • A problem was fixed for a failed PCI oscillator that was not guarded, causing repeated errors with SRC B15050A6 and B158E504 logged on each IPL of the system.
  • A problem was fixed for a local clock card (LCC)  failure with SRC 11001515 that was missing a part number and location code.  This information has been added for LCC faults so the FRU to replace is properly identified.
  • A problem was fixed for a defective PCI oscillator in the local clock card (LCC) with SRC BC58090F that caused a IPL failure for the node instead of failing over to the redundant LCC.
  • A problem was fixed for a service processor dump with error logs  B181E911 and B181D172 during an IPL.  The error logs were for the detection of defunct processes but otherwise the IPL was successful.
  • A problem was fixed for Digital Power Subsystem Sweep (DPSS) firmware updates that caused an error log with SRC B1819906 but otherwise was successful.
  • A problem was fixed for missing Keyword (KW) and Resource ID (RID) for SRC B181A40F.
  • A problem was fixed for a I2C bus lock error during a CEC power off that caused a ten minute delay for the power off and  errorlog SRCs  B1561314 and B1814803 with error number (errno) 3E.
  • A problem was fixed for the System Feature Code for the E870(9119-MME) being displayed as "EPBB" by IBM i "DSPSYSVAL QPRCFEAT"  when it should be "EPBA".  This created a problem for certain  IBM i software packages whose license was tied to the System Feature Code.  The System Feature Code, as seen in IBM i  partitions, does not update immediately with concurrent activation of the fix pack, but it will eventually change to the correct "EPBA" value within 24 hours.  If it is necessary to see the new System Feature Code value immediately,  a re-IPL of the system is needed.
  • A problem was fixed for concurrent firmware updates to a system that needed to be re-IPLed after getting a B113E504 SRC during activation of the new firmware level on the hypervisor.  The code update activate failed if the Sleep Winkle (SLW) images were significantly different between the firmware levels.  The SLW contains the state of the processor and cache so it can be restored after sleep or power saving operations.
  • A problem was fixed for System Power Control Network (SPCN) failover for a I/O module A/C power fault on the PCIe3  I/O expansion drawer (#EMX0).  A sideband failure on one I/O module was blocking SPCN commands for the entire drawer instead of SPCN failing over to a working I/O module.  The broken SPCN communications path prevented  concurrent maintenance operations on the expansion drawer.
  • A problem was fixed for a possible lack of recovery for an A/C power loss condition on the PCIe3  I/O expansion drawer (#EMX0).   If there was an outstanding problem on the expansion drawer and an A/C loss occurred while the earlier error was still unprocessed, the auto-recovery for the A/C power loss would not have happened.
  • A problem was fixed for a missing FRU call out for error SRC B7006A87  when unable to read the drawer module logical flash VPD for the PCIe3 I/O expansion drawer (#EMX0).
  • For a partition that has been migrated with Live Partition Mobility (LPM) from FW730 to FW740 or later, a problem was fixed for a Main Storage Dump (MSD) IPL failing with SRC B2006008.  The MSD IPL can happen after a system failure and is used to collect failure data.  If the partition is rebooted anytime after the migration, the problem cannot happen.  The potential for the problem existed between the active migration and a partition reboot.
  • A problem was fixed for partial loss of Entitlement for On/Off Memory Capacity On Demand (also called Elastic COD).  Users with large amounts of Entitlement on the system of greater than "65535 GB * Days" could have had a truncation of the Entitlement value on a re-IPL of the system.  To recover lost Entitlement, the customer can request another On/Off Enablement Code from IBM support to "re-fill" their entitlement.
  • A problem was fixed for a management console command line failure with a return code 0x40000147 (invalid lock state) when trying to delete SR-IOV shared mode configurations.  This could have occurred if the adapter slot had been re-purposed without involvement of the management console and was owned and operational at the time of the requested delete.  With the fix, the current ownership of the slot is honored and only the SR-IOV shared mode configuration data is deleted on the force delete.
  • A problem was fixed for an  incorrect restriction on the amount of "Unreturned"  resources allowed for a Power Enterprise Pool (PEP).  PEP allows for logical moving of resources (processors and memory) from one server to another.  Part of this is 'borrowing' resources from one server to move to another. This may result in "Unreturned" resources on the source server. The management console controls how many total "Unreturned" PEP resources can exist.  For this problem,  the user had some "Unreturned" PEP memory and asked to borrow more but this request was incorrectly refused by the hypervisor.
  • A problem was fixed for a PCIe3 I/O expansion drawer (#EMX0) error with SRCs  B7006A82 and B7004137 for a missing FRU location code.  The FRU location code for the Active Optical Cable (AOC)  was added to identify the failing drawer side.
  • A problem was fixed for a PCIe3 I/O expansion drawer (#EMX0)  failing to IPL when the IPL includes a FPGA update for the drawer.  The FPGA update is actually good but perceived as a failure when the FPGA resets as part of the update.  For the problem, a re-IPL of the system would have fixed the drawer.
  • A problem was fixed for Live Partition Mobility (LPM) to prevent a memory access error during LPM operations with unpredictable affects.  When data is moved by LPM, the underlying firmware code requires that the buffers be 4K aligned.  The fixes made now force the buffers to be 4K aligned and if there is still an alignment issue, the LPM operation will fail without impacting the system.
  • A problem was fixed for an On-Chip Controller (OCC) failure after a system dump with SRCs B18B2616 and BC822024 reported.  This resulted in the system running with reduced performance in safe mode, where processor clock frequencies are lowered to minimum levels to avoid hardware errors since the OCC is not available to monitor the system.   A re-IPL of the system would have resolved the problem.
  • A  performance problem was fixed for systems entering processor hang recovery prematurely with SRC B111E504 and PBCENTFIR(9) "PB_CENT_HANG_RECOV".  The ability of the L3 cache to prefetch memory was extended to speed the memory accesses and prevent a processor hang condition for applications running with lower memory affinity.
  • A problem was fixed for a processor error causing a Hostboot terminate instead of a deconfiguration of the bad hardware and continuation of the IPL.  The state of the processors was synchronized between the service processor and the Hostboot process to correct the error.
  • A problem was fixed for a USB Save and Restore of machine configuration to not lose the system name.
  • A problem was fixed for Advanced System Management Interface (ASMI) help text for menu "I/O Adapter Enlarged Capacity" being missing with the system IPLed and partitions running.  The help text is now available for the system in the powered on state as well as in the powered off state.
  • A problem was fixed for an intermittent power supply error SRC 1100D008 with a flood of VPD SRC B1504804 with errno 3Es logged on a re-ipl of the CEC but that did not result in an IPL failure.
  • A problem was fixed for a LED intermittently not lighting for an enclosure with a fault.
  • A problem was fixed for an intermittent PSI link error with SRC B15CDA27 after a firmware update or reset/reload of the service processor.
  • A problem was fixed for PCIe3 adapters failing when requesting more than 32 Message Signaled Interrupts (MSI-X).  The adapter may fail to ping or cause OS tasks to hang that are using the adapter.  This problem was found specifically on the 10 Gb Ethernet-SR (Short Range) PCIe3 adapter with feature codes #5275 and #5769 and on the 56 Gb Infiniband (IB) Fourteen Data Rate (FDR) adapter with feature codes #EC32, #EC33, #EL3D, and #EL50 and CCIN 2CE7.  However, other PCIe adapters may also be affected.
  • A problem was fixed for IBM copyright statements being displayed on the System Management Services (SMS) menu after a repair or replacement of system hardware.

System firmware changes that affect certain systems

  • HIPER/Pervasive:  For partitions with a graphics console and USB keyboard, a problem was fixed for a OS boot hang at the CA00E100 progress SRC.  For the problem, the hang can be avoided by issuing the boot command from the Open Firmware (OF) prompt.
  • HIPER/Pervasive:  On systems using PowerVM with shared processor partitions that are configured as capped or in a shared processor pool, there was a problem found that delayed the dispatching of the virtual processors which caused performance to be degraded in some situations.  Partitions with dedicated processors are not affected.   The problem is rare and can be mitigated, until the service pack is applied, by creating a new shared processor AIX or Linux partition and booting it to the SMS prompt; there is no need to install an operating system on this partition.  Refer to help document http://www.ibm.com/support/docview.wss?uid=nas8N1020863 for additional details.
  • DEFERRED:  A problem was fixed for Non-Volatile Memory express (NVMe) adapters, plugged into PCIe3 switches, mis-training to generation 1 instead of generation 3.   NVMe adapters attached directly to the PCIe3 slots trained correctly to the generation 3 specification. This fix requires a re-IPL of the system to correct the training of any mis-trained adapters.
  • On multiple-node systems, a problem was fixed for a missing location code, part, and serial number for a faulty symmetric multiprocessing (SMP) cable in the call home B1504922 error log.
  • On multiple-node systems, a problem was fixed for a two hour IPL hang in HostBoot caused by multiple B18ABAAB errors from more than one node.  The Hostboot process failed to go into its reconfiguration loop to do error recovery and continue the IPL.
  • On a system with redundant service processors,  a problem was fixed for an IPL failure for a bad service processor cable on the primary service processor with SRCs B1504904 and B18ABAAB logged.  The system should have did an error failover to the backup service processor and continued the IPL to get the partitions running.
  • On a system with redundant service processors where redundancy is disabled, a problem was fixed for an unrecoverable (UE) SRC B181DA19 being logged on a re-IPL after a checkstop error.  The error log did not interfere with the reIPL which was successful.
  • On multiple-node systems, a problem was fixed for extraneous error logs after a 12V power fault.  After termination, there were additional 110026Bx error log entries that should have been ignored.
  • On a system with redundant service processors, a problem was fixed for the isolation procedures for an Anchor card error and system VPD collection failure with termination SRC B181A40F .  FSPSP04 and FSPSP06 are no longer called out as part of reporting the VPD collection failure.  FSPSP30 has been updated with isolation steps for this problem and is called out and should be used for the problem isolation.  Retain tip H213935 also provides the FRU isolation steps.  Procedure FSPSP30 tries to replace the service processor first.  If that does not work, then the procedure has the Anchor card replaced.
  • On multiple-node systems, a problem was fixed to isolate a power fault during IPL to the specific node and guard the node, and allow the rest of the system to IPL.  Previously, the power fault would not be localized to the problem node and it caused the IPL of all the nodes of the system to fail.
  • On a system with redundant service processors, a problem was fixed for failovers to the backup service processor that caused an On-Chip Controller (OCC) abort.  This placed the CEC in a "safe" mode where it ran at reduced processor clock frequencies to prevent exceeding the power limits while not under OCC control.
  • On a system with an IBM i partition using Active Memory Sharing (AMS),  a problem was fixed for internal memory management errors caused by deleting a IBM i partition that had been powered off in the middle of a Main Storage Dump (MSD).  Until the fix is installed, if a MSD is interrupted for a IBM i partition that has AMS, the partition should be powered on and powered off normally before a delete of the partition is done to prevent errors with unpredictable affects.  This problem does not affect the S822 (8284-22A), S812L(8247-21L), S822L (8247-22L), S824L(8247-42L), and E850 (8408-E8E) models.
  • On a system with redundant service processors, a problem was fixed for a failover to the backup service processor during a power off of the CEC that caused a hypervisor time-out with SRC B182953C.  This error was caused by a delay in synchronizing the state of the hypervisor to the backup service processor but it did not prevent the power off from completing successfully.
  • On a system with redundant service processors, a problem was fixed for a firmware update causing an error log server dump with SRC B1818601.  The error log server restarted automatically to recover from the error and the firmware update was successful.
SC830_048_048 / FW830.00

06/08/15
Impact:  New      Severity:  New

New Features and Functions

NOTE:
  • POWER8 (and later) servers include an “update access key” that is checked when system firmware updates are applied to the system.  The initial update access keys include an expiration date which is tied to the product warranty. System firmware updates will not be processed if the calendar date has passed the update access key’s expiration date, until the key is replaced.  As these update access keys expire, they need to be replaced using either the Hardware Management Console (HMC) or the Advanced Management Interface (ASMI) on the service processor.  Update access keys can be obtained via the key management website: http://www.ibm.com/servers/eserver/ess/index.wss.
  • Support for Little Endian (LE) Linux in PowerVM.  With PowerVM LE guest support, all three Linux on Power distribution partners (SUSE, Canonical, and Red Hat) with LE operating systems can run on the same IBM Power Systems.
  • Support for allowing the PowerVM hypervisor to continue to run after the service processor has become unresponsive with a SRC B1817212.  Any active partitions will continue to run but they will not be able to be managed by the management console.  The partitions can be allowed to run until the next scheduled service window at which time the service processor can be recovered with an AC power cycle or a pin-hole reset from the operator panel.  This error condition would only be seen on a system that had been running with a single service processor (no redundancy for the service processor).
  • Support for three and four node configurations of the E880 (9119-MHE) system.
  • Support for an increase of the maximum number of PCIe 3 I/O expansion drawers (#EMX0) that can be attached to an E870 /E880 node from two to four.
  • Support for Single Root I/O Virtualization (SR-IOV) that enables the hypervisor to share a SR-IOV-capable PCI-Express adapter across multiple partitions. Twelve ethernet adapters are supported with the SR-IOV NIC capability, when placed in the P8 system  (SR-IOV supported in both native mode and through VIOS):
    - PCIe3  4-port 10GbE SR Adapter                           (F/C EN15 and CCIN 2CE3)
    - PCIe3  4-port 10GbE SR Adapter                         (F/C EN16 and CCIN 2CE3).  Fits E870/E880 system node PCIe slot.
    - PCIe3  4-port 10GbE SFP+ Copper Adapter                    (F/C EN17 and CCIN 2CE4)
    - PCIe3  4-port 10GbE SFP+ Copper Adapter                    (F/C EN18 and CCIN 2CE4).  Fits E870/E880 system node PCIe slot.
    - PCIe2  4-port (10Gb FCoE & 1GbE) SR and RJ45 SFP+ Adapter        (F/C EN0H and CCIN 2B93)
    - PCIe2 LP 4-port (10Gb FCoE & 1GbE) SR and RJ45  SFP+ Adapter        (F/C EN0J and CCIN 2B93)
    - PCIe2 LP Linux 4-port (10Gb FCoE & 1GbE) SR and RJ45 SFP+ Adapter       (F/C EL38 and CCIN 2B93)
    - PCIe2  4-port (10Gb FCoE & 1GbE) LR and RJ45 Adapter             (F/C EN0M and CCIN 2CC0)
    - PCIe2 LP 4-port (10Gb FCoE & 1GbE) LR and RJ45 Adapter              (F/C EN0N and CCIN 2CC0)
     -PCIe2  4-port (10Gb FCoE & 1GbE) SFP+Copper and RJ45 Adapter        (F/C EN0K and CCIN 2CC1)
    - PCIe2 LP 4-port (10Gb FCoE & 1GbE) SFP+Copper and RJ45    Adapter        (F/C EN0L and CC IN 2CC1)
    - PCIe2 LP Linux 4-port (10Gb FCoE & 1Gb Ethernet) SFP+Copper and RJ45    (F/C EL3C and CCIN 2CC1)
    These adapters each have four ports, and all four ports are enabled with SR-IOV function. The entire adapter (all four ports) is configured for SR-IOV or none of the ports is.
    System firmware updates the adapter firmware level on these adapters to 10.2.252.16 when a supported adapter is placed into SR-IOV mode.
    Support for SR-IOV adapter sharing is now available for adapters in the PCIe3 I/O Expansion Drawer with F/C #EMX0.
    SR-IOV NIC on the Power P8 systems is supported by:
        - AIX 6.1 TL9 SP4 and APAR IV63331, or later
        - AIX 7.1 TL3 SP4 and APAR IV63332, or later
        - IBM i 7.1 TR8, or later (Supported on S824/S814)
        - IBM i 7.2  or later  (Supported on S824/S814)
        - IBM i 7.1 TR9, or later (Supported on E870/E880)
        - IBM i 7.2 TR1, or later  (Supported on E870/E880)
                - Red Hat Enterprise Linux 6.5 or later ( Supported on E870/E880/S812L/S822/S822L/S814/S824/S824L except for adapters with F/Cs EN15/EN16/EN17/EN18)
        - Red Hat Enterprise Linux 6.6, or later (Supported on E850 and minimum level needed for adapters with F/Cs EN15/EN16/EN17/EN18)
        - Red Hat Enterprise Linux 7.1, or later
        - SUSE Linux Enterprise Server 11 SP1 or later  (Supported on S812L/S822/S822L/S814/S824/S824L)
        - SUSE Linux Enterprise Server 11 SP3 or later  (Supported on E870/E880)
        - SUSE Linux Enterprise Server 12, or later  (Supported on E850)
        - Ubuntu 15.04 or later (Supported on E850/S812L/S822/S822L/S814/S824/S824L) 
        - VIOS 2.2.3.4 with interim fix IV63331, or later
  • Support for an upgrade from 8-core processors to 12-core processors for the E880 (9119-MHE) system.
  • Support for adjusting voltage regulators input voltage dynamically based on regulator slave failures to achieve the optimal voltage for system operation for normal and degraded conditions.
System firmware changes that affect all systems
  • A problem was fixed to eliminate unneeded guard data from call home messages for the cases where there is no hardware error in the system.
  • On systems with redundant service processors, a problem was fixed in the run-time error failover to the backup service processor so it does not terminate on FRU support interface (FSI) errors.  In the case of FSI errors on the new primary service processor, the primary will do a reset/reload instead of a terminate.
  • A problem was fixed to call home guarded FRUs on each IPL.  Only the initial failure of the hardware was being reported to the error log.
  • Support was added to the Advanced System Management Interface (ASMI) USB menu to allow a system dump to be collected to USB with the power on to the system.  This allows the dump to be collected with the system memory state intact.
  • A problem was fixed for the service processor error log handling that caused SRC B150BAC5 errors when converting a error log entry from an object into a flattened array of bytes.
  • A problem was fixed that prevented a second management console from being added to the CEC.  In some cases, network outages caused defunct management console connection entries to remain in the service processor connection table,  making connection slots unavailable for new management consoles  A reset of the service processor could be used to remove the defunct entries.
  • A problem was fixed to eliminate a false error log and call home for a SRC1100154F fan fault caused by an unplugged power cable.
  • A problem was fixed for a highly intermittent IPL failure with SRC B18187D9 caused by a defunct attention handler process.  For this problem, the IPL will continue to fail until the service processor is reset.
    A problem was fixed for missing FRU information in SRC 11001515.   SRC 11001515 was logged indicating replacement of power supply hardware, but did not include the location code, the part number, the CCIN, or the serial number.
  • A problem was fixed for systems with a corrupted date of "1900" showing for the Update Access Key (UAK).  The firmware update is allowed to proceed on systems with a bad UAK date because the fix is in an emergency service pack.  After the fix is installed, the user should correct the UAK date, if needed, by using the original UAK key for the system.  On the Management Console,  enter the original update access key via the "Enter COD Code" panel. Or on the Advanced System Manager Interface (ASMI),  enter the original update access key via the "On Demand Utilities/COD Activation" panel.
  • A problem with concurrent PCIe adapter maintenance was fixed that caused On-Chip Controller (OCC) resets with SRCs logged of B18B2616 and BC822029, forcing the system into safe mode (processor voltage/frequency reduced to a "safe" level where thermal monitoring is not required).  Recovery from safe mode requires a system re-IPL.
  • A problem was fixed for I/O adapters so that BA400002 errors were changed to informational for memory boundary adjustments made to the size of DMA map-in requests.  These DMA size adjustments were marked as UE previously for a condition that is normal.
System firmware changes that affect certain systems
  • On systems in PowerVM mode, a problem was fixed for unresponsive PCIe adapters after a partition power off or a partition reboot.
  • On systems using Virtual Shared Processor Pools (VSPP), a problem was fixed for an inaccurate pool idle count over a small sampling period.
  • On systems with partitions using shared processors, a problem was fixed that could result in latency or timeout issues with I/O devices.
  • On systems using PowerVM,  a problem was fixed for a hypervisor deadlock that results in the system being in a "Incomplete state" as seen on the management console.  This deadlock is the result of two hypervisor tasks using the same locking mechanism for handling requests between the partitions and the management console.  Except for the loss of the management console control of the system, the system is operating normally when the "Incomplete state" occurs.
  • On systems with memory mirroring enabled, a problem was fixed for PowerVM over-estimating its memory needs, allowing more memory to be used by the partitions.
  • On systems using PowerVM, a problem was fixed for the handling of the error of multiple cache hits in the instruction effective-to-real address translation cache (IERAT).  A multi-hit IERAT error was causing system termination with SRC B700F105.  The multi-hit IERAT is now recognized by the hypervisor and reported to the OS where it is handled.
  • On systems using PowerVM, a problem was fixed to allow booting off an iSCSI device.  For the failure, the partition firmware error logs had SRC BA012010 "Opening the TCP node failed." and SRC BA010013 "The information in the error log entry for this SRC provides network trace data."  The open firmware standard output trace showed SRC BA012014  "The TCP re-transmission count of 8 was exceeded. This indicates a large number of lost packets between this client and the boot or installation server" followed by SRC BA012010.
  • On systems using PowerVM, support was added for USB 2.0 HUBs so that a keyboard plugged into the USB 2.0 HUB will work correctly at the SMS menus.  Previously, a keyboard plugged into a USB 2.0 HUB was not a recognized device.

4.0 How to Determine The Currently Installed Firmware Level

You can view the server's current firmware level on the Advanced System Management Interface (ASMI) Welcome pane. It appears in the top right corner. Example: SC830_123.


5.0 Downloading the Firmware Package

Follow the instructions on Fix Central. You must read and agree to the license agreement to obtain the firmware packages.

Note: If your HMC is not internet-connected you will need to download the new firmware level to a USB flash memory device or ftp server.


6.0 Installing the Firmware

The method used to install new firmware will depend on the release level of firmware which is currently installed on your server. The release level can be determined by the prefix of the new firmware's filename.

Example: SCxxx_yyy_zzz

Where xxx = release level

Instructions for installing firmware updates and upgrades can be found at http://www-01.ibm.com/support/knowledgecenter/9119-MHE/p8ha1/updupdates.htm

IBM i Systems:

For information concerning IBM i Systems, go to the following URL to access Fix Central: 
http://www-933.ibm.com/support/fixcentral/

Choose "Select product", under Product Group specify "System i", under Product specify "IBM i", then Continue and specify the desired firmware PTF accordingly.

7.0 Firmware History

The complete Firmware Fix History for this Release Level can be reviewed at the following url:
http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/SC-Firmware-Hist.html