VL910_122_089 / FW910.20
12/12/18 |
Impact:
Data
Severity: HIPER
New features and functions
- Support was
enabled for eRepair spare lane deployment for fabric and memory buses.
System firmware changes that affect all systems
- HIPER/Non-Pervasive:DEFERRED:
A problem was fixed for a potential problem with I/O that could
result in undetected data corruption.
- DEFERRED: A
problem was fixed for DASD VRM reduced stability margins leading to a
possible system shutdown due to temperature component aging over a long
period of time. The DASD VRM is not updated with the fix until
after the system IPLs from a powered off state. It is recommended
that this fix be activated as soon as possible but fix activation
should not be delayed for more than three months maximum.
- DEFERRED: A
problem was fixed for PCIe and SAS adapters in slots attached to a PLX
(PCIe switch) failing to initialize and not being found by the
Operating System. The problem should not occur on the first IPL
after an AC power cycle, but subsequent IPLs may experience the problem.
- DEFERRED: A
problem was fixed for the PCIe3 I/O expansion drawer (#EMX0) links to
improve stability. Intermittent training failures on the
links occurred during the IPL with SRC B7006A8B logged. With the
fix, the link settings were changed to lower the peak link signal
amplification to bring the signal level into the middle of the
operating range, thus improving the high margin to reduce link training
failures. The system must be re-IPLed for the fix to activate.
Without the fix, the system can be powered off and the re-IPLed to
restore the PCIe links.
- DEFERRED:
A problem was fixed for concurrent maintenance operations for PCIe
expansion drawer cable cards and PCI adapters that could cause
loss of system hardware information in the hypervisor with these side
effects: 1) partition secure boots could fail with SRC BA540100
logged.; 2) Live Partition Mobility (LPM) migrations could be blocked;
3) SR-IOV adapters could be blocked from going into shared mode; 4)
Power Management services could be lost; and 5) warm re-IPLs of the
system can fail. The system can be recovered by powering off and
then IPLing again.
- DEFERRED: A
problem was fixed for predictive error logs occurring on the IPL
following a DIMM error recovery. These logs, related to failed
memory scrubbing, have the following "Signature Description":
"mba(n0p15c1) () ERROR: command complete analysis failed". These
error logs do not indicate a hardware problem and may be ignored.
- A problem was fixed for link speed for PCIe Generation 4
adapters showing as "unknown" in the Advanced System Management
Interface (ASMI) PCIe Hardware Topology menu.
- A problem was fixed for differential memory interface (DMI)
lane sparing to prevent shutting down a good lane on the TX side of the
bus when a lane has been spared on the RX side of the bus. If
the XBUS or DMI bus runs out of spare lanes, it can checkstop the
system, so the fix helps use these resources more efficiently.
- A problem was fixed for IPL failures with SRC
BC50090F when replacing Xbus FRUs. The problem occurs if VPD has
a stale bad lane record and that record does not exist on both ends of
the bus.
- A problem was fixed for a firmware update concurrent remove
and activate that fails in the hypervisor during the activate with SRC
B7000AFF. To recover the system, do a re-IPL and it will be at
the correct firmware level that is expected for the remove operation.
- A problem was fixed for a flood of BC130311 SRCs that could
occur when changing Energy Scale Power settings, if the Power
Management is in a reset loop because of errors.
- A problem was fixed for SR-IOV adapter workloads being
suspended with SRC B400FF01 logged while an internal reset of SR-IOV
virtual function in the hypervisor occurs. This problem is
infrequent and caused by heavy workloads for the adapter or vNIC
failovers. The workloads resume after the virtual function reset
without user intervention.
- A problem was fixed for SR-IOV VFs, where a VF configured
with a PVID priority may be presented to the OS with an incorrect
priority value.
- A problem was fixed for the creation of a vNIC adapter that
may show the MAC address twice and cause confusion. For the AIX
OS, the duplicate MAC address shows on the entstat output. No
recovery is needed for this error except to ignore the extra MAC
address in the ethernet adapter status.
- A problem was fixed to reduce the time to reach a "failed"
status on an SR-IOV adapter for certain persistent errors.
Without the fix, adapter spends an extended period of time in the "not
ready" state, eventually reaching the "failed" state. With
the fix, the adapter is able to go to the "failed" state in less than
30 seconds for the persistent fault.
- A problem was fixed for a SR-IOV Virtual Function (VF)
configured with a PVID that fails to function correctly after a
VF reset. It will allow the receiving of untagged frames but not
be able to transmit the untagged frames.
- A problem was fixed for a SMS ping failure for a SR-IOV
adapter Virtual Function (VF) with a non-zero Port VLAN ID
(PVID). This failure may occur after the partition with the
adapter has been booted to AIX, and then rebooted back to SMS.
Without the fix, residue information from the AIX boot is retained for
the VF that should have been cleared.
- A problem was fixed for SRCs B400FF01 and B200F011
experienced for false SR-IOV adapter errors during Live Partition
Mobility (LPM) migrations of a logical partition with vNIC
clients. The SR-IOV adapter does recover from the errors but
there is delay in the adapter communications while the adapter
recovers. These errors can be ignored when evaluating the outcome
of a LPM migration.
- A problem was fixed for partition SMS menus to display
certain network adapters that were unviewable and not usable as boot
and install devices after a microcode update. The problem network
adapter is still present and usable at the OS. The adapters with
this problem have the following feature codes: EN0A, EN0B, EN0H,
EN0J, EN0K, EN0L, EN15, EL5B, EL38, EL3C, EL56, and EL57.
- A problem was fixed for a Logical LAN (l-lan) device
failing to boot when there is a UDP packet checksum error. With
the fix, there is a new option when configuring a l-lan port in SMS to
enable or disable the UDP checksum validation. If the adapter is
already providing the checksum validation, then the l-lan port needs to
have its validation disabled.
- A problem was fixed for Hostboot error log IDs (EID)
getting reused from one IPL to the next, resulting in error logs
getting suppressed (missing) for new problems on the subsequent
IPLs if they have a re-used EID that was already present in the service
processor error logs.
- A problem was fixed for error log truncation with SRC
B1818A12 logged for the error. This problem occurs only rarely
when creating a combined error log entry that exceeds the error log
entry maximum size. With the fix, these type of combinations are
not done if too large, and two error logs are written instead
- A problem was fixed for coherent accelerator processor
proxy (CAPP) unit errors being called out as CEC hardware
Subsystem instead of PROCESSOR_UNIT.
- A problem was fixed for a Self Boot Engine (SBE)
recoverable error at runtime causing the system to go into Safe Mode.
- A problem was fixed for an IPL that ends with the HMC in
the "Incomplete" state with SRCs B182951C and A7001151 logged.
Partitions may start and can continue to run without the HMC services
available. In order to recover the HMC session, a re-IPL of
the system is needed (however, partition workloads could continue
running uninterrupted until the system is intentionally re-IPLed at a
scheduled time). The frequency of this problem is very low as it
rarely occurs.
- A problem was fixed for a system failure with SRC B700F103
that can occur if a shared-mode SR-IOV adapter is moved from a
high-performance slot to a lower performance slot. This
problem can be avoided by disabling shared mode on the SR-IOV adapter;
moving the adapter; and then re-enabling shared mode.
- A problem was fixed for a rare Live Partition Mobility
migration hang with the partition left in VPM (Virtual Page Mode) which
causes performance concerns. This error is triggered by a
migration failover operation occurring during the migration state of
"Suspended" and there has to be insufficent VASI buffers available to
clear all partition state data waiting to be sent to the migration
target. Migration failovers are rare and the migration state of
"Suspended" is a migration state lasting only a few seconds for most
partitions, so this problem should not be frequent. On the HMC,
there will be an inability to complete either a migration stop or a
recovery operation. The HMC will show the partition as migrating
and any attempt to change that will fail. The system must be
re-IPLed to recover from the problem.
- A problem was fixed for Linux or AIX partitions crashing
during a firmware assisted dump or when using Linux kexec to restart
with a new kernel. This problem was more frequent for the Linux
OS with kdump failing with "Kernel panic - not syncing: Attempted to
kill init" in some cases.
- A problem was fixed for a SR-IOV adapter vNIC configuration
error that did not provide a proper SRC to help resolve the issue of
the boot device not pinging in SMS due to maximum transmission unit
(MTU) size mismatch in the configuration. The use of a vNIC
backing device does not allow configuring VFs for jumbo frames when the
Partition Firmware configuration for the adapter (as specified on the
HMC) does not support jumbo frames. When this happens, the vNIC
adapter will fail to ping in SMS and thus cannot be used as a boot
device. With the fix, the vNIC driver configuration code is
now checking the vNIC login (open) return code so it can issue an SRC
when the open fails for a MTU issue (such as jumbo frame
mismatch) or for some other reason. A jumbo frame is an Ethernet
frame with a payload greater than the standard MTU of 1,500 bytes and
can be as large as 9,000 bytes.
- A problem was fixed for the USB port having the wrong
location code assigned. The "P1-T4-L1 USB DVD R/RW or RAM Drive"
location code should be "P1-T3-L1". The USB DVD still works
correctly but reported location codes such as in error logs will
have the wrong location code shown.
This problem only pertains to IBM Power System models S914(9009-41A),
S924(9009-42A), and H924 for SAP HANA (9223-42H).
- A problem was fixed for SR-IOV adapter dumps hanging with
low-level EEH events causing failures on VFs of other non-target SR-IOV
adapters.
- A problem was fixed for preventing loss of function on an
SR-IOV adapter with an 8MB adapter firmware image if it is placed into
SR-IOV shared mode. The 8MB image is not supported at the
FW910.20 firmware level. With the fix, the adapter with the 8MB
image is rejected with an error without an attempt to load the older
4MB image on the adapter which could damage it. This problem
affects the following SR-IOV adapters: #EC2R/#EC2S with CCIN
58FA; #EC2T/#EC2U with CCIN 58FB; and #EC3L/#EC3M with CCIN 2CEC.
- A problem was fixed for adapters in slots attached to a PLX
(PCIe switch) failing with SRCs B7006970 and BA188002 when a
second and subsequent errors on the PLX failed to initiate PLX
recovery. For this infrequent problem to occur, it requires a
second error on the PLX after recovery from the first error.
- A problem was fixed for an intermittent IPL failure with
SRCs B150BA40 and B181BA24 logged. The system can be
recovered by IPLing again. The failure is caused by a memory
buffer misalignment, so it represents a transient fault that should
occur only rarely.
- A problem was fixed for system termination for a re-IPL
with power on with SRC B181E540 logged. The system can be
recovered by powering off and then IPLing. This problem occurs
infrequently and can be avoided by powering off the system between IPL.
System firmware changes that affect certain systems
- On a system with a Cloud Management Console and a HMC Cloud
Connector, a problem was fixed for memory leaks in the Redfish server
causing Out of Memory (OOM) resets of the service processor.
- On a system witn an IBM i partition, A problem was fixed
for a DLPAR force-remove of a physical IO adapter from an IBM i
partition and a simultaneous power off of the partition causing the
partition to hang during the power off. To recover the partition
from the error, the system must be re-IPLed. This problem is rare
because there is only a 2-second timing window for the DLPAR and power
off to interfere with each other.
- For systems with a shared memory partition, a problem
was fixed for Live Partition Mobility (LPM) migration hang after a
Mover Service Partition (MSP) failover in the early part of the
migration. To recover from the hang, a migration stop command
must be given on the HMC. Then the migration can be retried.
- For a shared memory partition, a problem was fixed
for Live Partition Mobility (LPM) migration failure to an indeterminate
state. This can occur if the Mover Service Partition (MSP)
has a failover that occurs when the migrating partition is in the state
of "Suspended." To recover from this problem, the partition must
be shutdown and restarted.
- On a system with an AMS partition, a problem was fixed for
a Live Partition Mobility (LPM) migration failure when migrating from
P9 to a pre-FW860 P8 or P7 system. This failure can occur if the
P9 partition is in dedicated memory mode, and the Physical Page Table
(PPT) ratio is explicitly set on the HMC (rather than keeping the
default value) and the partition is then transitioned to Active Memory
Sharing (AMS) mode prior to the migration to the older system.
This problem can be avoided by using dedicated memory in the partition
being migrated back to the older system.
- On a system with an active IBM i partition, a problem was
fixed for a SPCN firmware download to the PCIe3 I/O expansion drawer
(feature #EMX0) Chassis Management Card (CMC) that could possibly get
stuck in a pending state. This failure is very unlikely as it
would require a concurrent replacement of the CMC card that is loaded
with a SPCN level that is older than 2015 (01MEX151012a). The
failure with the SPCN download can be corrected by a re-IPL of the
system.
|
VL910_107_089 / FW910.10
09/05/18 |
Impact: Availability
Severity: SPE
System firmware changes that may require customer actions
prior to the firmware update
- DEFERRED: On
a system with a partition with dedicated processors that are set to
allow processor sharing with "Allow when partition is active" or "Allow
always", a problem was fixed for a concurrent firmware update
from FW910.01 that may cause the system to hang. This fix is
deferred, so it is not active until after the next IPL of the system,
so precautions must be taken to protect the system. Perform the
following steps to determine if your system has a partition with
dedicated processors that are set to share. If these partitions
exist, change them to not share processors while active; or shut down
the affected partitions; or do a disruptive update to put on this
service pack.
1) From the HMC command line, Run: lssyscfg -r sys -F name
2) For each system you intend to update firmware, issue the following
HMC command:
lshwres -m <System Name> --level lpar -r proc -F
lpar_name,curr_sharing_mode,pend_sharing_mode
replacing <System Name> with the name as displayed by the first
command.
3) Scan the output for "share_idle_procs_active" or
"share_idle_procs_always". This identifies the affected
partitions.
4) You need to take one of the three options below to install this
firmware level:
a) if affected partitions found, change the lapr to "never allow" or
"allow when partition is inactive" on the lpar settings, and set back
the value to its original value after the code update. These
changes are concurrent when performed on the lpar settings and not in
the profile.
b) Or, shut down partitions identified in step 3. Proceed
with concurrent code update. Then restart the partitions.
c) Or, apply the firmware update disruptively (power off system
and install) to prevent a possible system hang.
New features and functions
- A change was
made to improve IPL performance for a system with a new DIMM installed
or for a system doing its first IPL. The performance is gained by
decreasing the amount of time used in memory diagnostics, reducing IPL
time by as much as 15 minutes, depending on the amount of memory
installed.
- Support was added for 24x7 data collection from the On-Chip
Controller sensors.
- Support was added to correctly isolate TOD faults with
appropriate callouts and failover to the backup topology, if
needed. And to do a reconfiguration of a backup topology to
maintain TOD redundancy.
- Support was disabled for erepair spare lane deployment for
fabric and memory buses. By not using the FRU spare hardware for
an erepair, the affected FRUs may have to be replaced sooner.
Prior to this change, the spare lane deployment caused extra error
messages during runtime diagnostics. When the problems with spare
lane deployment are corrected, this erepair feature will be enabled
again in a future service pack.
System firmware changes that affect all systems
- A security problem was fixed in the DHCP client on the
service processor for an out-of-bound memory access flaw that could be
used by a malicious DHCP server to crash the DHCP client process.
The Common Vulnerabilities and Exposures issue number is CVE-2018-5732.
- DEFERRED: A
problem was fixed for PCIe link stability errors during the IPL for the
PCIe3 I/O Expansion Drawer (Feature code #EMX0) with Active Optical
Cables (AOCs). One or more of the following SRCs may be logged at
completion of IPL: B7006A72, B7006A8B, B7006971, and 10007900.
The fix improves PCIe link stability for this feature.
- DEFERRED: A
problem was fixed for an erroneous SRC
11007610 being logged when hot-swapping CEC fans. This SRC may be
logged if there is more than a two-minute delay between removing
the old fan and installing the new fan. The error log may be
ignored.
- DEFERRED: A
problem was fixed for a hot plug
of a new 1400W power supply that fails to turn on. The
problem is intermittent, occurring more frequently for the cases where
the hot plug insertion action was too slow and maybe at a slight angle
(insertion not perfectly straight). Without the fix, after
a hot plug has been attempted, ensure the power supply LEDs are
on. If the LEDs are not on, retry the plug of the power
supply using a faster motion while keeping the angle of insertion
straight.
- DEFERRED: A
problem was fixed for a host reset of the
Self Boot Engine (SBE). Without the fix, the reset of the SBE
will hang during error recovery and that will force the system into
Safe Mode. Also, a post dump IPL of the system after a
system Terminate Immediate will not work with a hung SBE, so a re-IPL
of the system will be needed to recover it.
- A problem was fixed for an enclosure LED not being lit when
there is a fault on a FRU internal to closure that does not have an LED
of its own. With the fix, the enclosure LED is lit if any FRUs
within the enclosure have a fault.
- A problem was fixed for DIMMs that have VPP shorted to
ground not being called out in the SRC 11002610 logged for the power
fault. The frequency of this problem should be rare.
- A problem was fixed for the Advanced System Management
Interface (ASMI) option for resetting the system to factory
configuration for not returning the Speculative Execution setting to
the default value. The reset to factory configuration does not
change the current value for Speculative Execution. To restore
the default, ASMI must be used manually to set the value. This
problem only pertains to the IBM Power System H922 for SAP HANA
(9223-22H) and the IBM Power System H924 for SAP HANA (9223-42H).
- A problem was fixed for the system early power warning
(EPOW) to be issued when only three of the four power supplies are
operation (instead of waiting for all four power supplies to go down).
- A problem was fixed for a failing VPP voltage regulator
possibly damaging DIMM with too high of a voltage level. With the
fix, the voltage to the DIMMs is shutdown if there is a problem with
voltage regulator to protect the DIMMs.
- A problem was fixed for an unplanned power down of the
system with SRC UE 11002600 logged when a unsupported device was
plugged into the service processor USB 2.0 ports on either of the
slots P1-C1-T1 or P1-C1-T2. This happened when a USB 3.0
DVD drive was plugged into the USB 2.0 slot and caused an overcurrent
condition. The USB 3.0 device was incorrectly not downward
compatible with the USB 2.0 slot. With the fix, such incompatible
devices will cause an informational log but will not cause a power off
of the system.
- A problem was fixed for the On-Chip Controller being able
to sense the current draw for the 12V PCIE adapters that are plugged
into channel 0 (CH0) of the APSS. CH0 was not enabled meaning
anything plugged into those connectors would not be included in the
total server power calculation which could impact power capping.
The system could run at higher power than expected without CH0 being
monitored.
- A problem was fixed for the TPM card LED so that it is
activated correctly.
- A problem was fixed for VRMs drawing current over the
specification. This occurred whenever heavy work loads went above
372 amps with WOF enabled. At 372 amps, a rollover to value "0"
for the current erroneously occurred and this allowed the frequency of
the processors in the system to exceed the normally expected values.
- A problem was fixed for Dynamic Memory Deallocation (DMD)
failing for memory configurations of 3 or 6 Memory Controller (MC)
channels per group. An error message of "Invalid MCS per group
value" is logged with SRC BC23E504 for the problem. If DMD was
working correctly for the installed memory but then began failing at a
later time, it may have been triggered by a guard of a DIMM which
resulted in a memory configuration that is susceptible to the problem
with DMD.
- A problem was fixed for a system with CPU part number
2CY058 and CCIN 5C25 to achieve a slightly more optimum frequency
for one specific EnergyScale Mode, Dynamic Performance Mode.
- A problem was fixed for a missing memory throttle
initialization that in a rare case could lead to an emergency shutdown
of the system. The missing initialization could cause the DIMMs
to oversubscribe to the power supplies in the rare failure mode where
the On-Chip Controller (OCC) fails to start and the Safe Mode default
memory throttle values are too high to stop the memory from overusing
the power from the power supplies. This could cause a power fault
and an emergency shutdown of the system.
- A problem was fixed for a memory translation error that
causes a request for a page of memory to be de-allocated to be
ignored in Dynamic Memory Deallocation (DMD). This misses the
opportunity to proactively relocate a partition to good memory and
running on bad memory may eventually cause a crash of the partition.
- A problem was fixed for an extraneous error log with
SRC BC50050A that has no deconfgured FRU. There was a recovered
error for a single bit in memory that requires no user action.
The BC50050A error log should be ignored.
- A problem was fixed for Hostboot error logs reusing
EID numbers for each IPL. This may cause a predictive error log
to go missing for a bad FRU that is guarded during the IPL. If
this happens, the FRU should be replaced based on the presence of the
guard record.
- A problem was fixed for a rare non-correctable memory
error in the service processor Self Boot Engine (SBE) causing a
Terminate Immediate (TI) for the system instead of recovering from the
error. With the fix, the SBE is working such that all SBE errors
are recoverable and do not affect the system work loads. This SBE
memory provides support for On-Chip Controller (OCC) tasks to the
service processor SBE but it is not related to the system memory used
for the hypervisor and host partition tasks.
- A problem was fixed for extraneous Predictive Error
logs of SRC B181DA96 and SRC BC8A1A39 being logged if the Self Boot
Engine (SBE) halts and restarts when the system host OS is
running, These error logs can be ignored as the SBE
recovers without user intervention.
- A problem was fixed for error logging for rare Low
Pin Count (LPC) link errors between the Host processor and the Self
Boot Engine (SBE). The LPC was changed to timeout instead of
hanging on a LPC error, providing helpful debug data for the LPC error
instead of system checkstop and Hostboot crash.
- A problem was fixed for the reset of the Self Boot
Engine (SBE) at run time to resolve SBE errors without impacting
the hypervisor or the running partitions.
- A problem was fixed for the ODL link in Open CAPI in
the case where ODL Link 1 (ODL1) is used and ODL Link 0 (ODL0) is not
used. As a circumvention, the errors are resolved if ODL 0 is
used instead, or in conjunction with the ODL1.
- A problem was fixed for the wrong DIMM being called out on
an over-temperature error with a SRC B1xx2A30 error log.
- A problem was fixed for adding a non-cable PCIe card
into a slot that was previously occupied by a PCIe3 Optical or Copper
Cable Adapter for the PCIe3 Expansion Drawer
The PCIe new card could fail with a I2C error with SRC BC100706
logged.
- A problem was fixed for call home data for On-Chip
Controller (OCC) error log sensor data being off in alignment by one
sensor. By visually shifting the data, the valid data values can
still be determined from the call home logs.
- A problem was fixed for slow hardware dumps that include
failed processor cores that have no clock signal. The dump
process was waiting for core responses and had to wait for a time-out
for each chip operation, causing dumps to take several hours.
With the fix, the core is checked for a proper clock, and if one does
not exist, the chip operations to that core are skipped to speed up the
hardware dump process significantly.
- A problem was fixed for ipmitool not being able to set the
system power limit when the power limit is not activated with the
standard option. With the fix, the ipmitool user can
activate the power limit "dcmi power activate" and then set the power
limit "dcmi power set _limit xxxx" where "xxxx" in the new
power limit in Watts.
- A problem was fixed for the OBUS to make it OpenCAPI
capable by increasing its frequency from 1563 Mhz to 1611 Mhz.
- A problem was fixed for a Workload Optimized Frequency
(WOF) reset limit failure not providing an Unrecoverable Error (UE) and
a callout for the problem processor. When the WOF reset limit is
reached and failed, WOF is disabled and the system is not running at
optimized frequencies.
- A problem was fixed for the callout of SRC BA188002 so it
does display three trailing extra garbage characters in the location
code for the FRU. The string is correct up to the line ending
white space, so the three extra characters after that should be
ignored. This problem is intermittent and does not occur for all
BA188002 error logs.
- A problem was fixed for the callout of scan ring failures
with SRC BC8A285E and SRC BC8A2857 logged but with no callout for the
bad FRU.
- A problem was fixed for the On-Chip Controller (OCC)
possibly timing out and going to Safe Mode when a system is changed
from the default maximum performance mode (Workload Optimized Frequency
(WOF) enabled) to nominal mode (WOF disabled) and then back to maximum
performance (WOF enabled again). Normal performance can be
recovered with a re-IPL of the system.
- A problem was fixed for the periodic guard reminder causing
a reset/reload of the service processor when it found a symbolic FRU
with no CCIN value in the list of guarded FRUs for the
system. Periodically as periodic guard reminder is run,
every 30 days by default, this problem can cause recoverable errors on
the service processor but with no interruption to the workloads on the
running partitions.
- A problem was fixed for a wrong SubSystem being logged in
the SRC B7009xxxx for Secure Boot Errors. "I/O Subsystem" is
displayed instead of the correct SubSystem value of "System Hypervisor
Firmware".
- A problem was fixed for the lost recovery of a failed Self
Boot Engine (SBE). This may happen if the SBE recovery occurs
during a reset of the service processor. Not only is the recovery
lost, but the error log data for the SBE failure may also be not be
written to the error log. If the SBE is failed and not recovered,
this can cause the post-dump IPL after a system Terminate
Immediate (TI) error to not be able to complete. To recover,
power off the system and IPL again.
- A problem was fixed for a missing SRC at the time runtime
diagnostics are lost and the Hostboot runtime services (HBRT) are put
into the failed state.
A B400F104 SRC is logged each time the HBRT hypervisor adjunct
crashed. On the fourth crash in one hour, HBRT is failed with no
further retries but no SRC is logged. Although a unique SRC is
not logged to indicate loss of runtime diagnostic capability, the
B400F104 SRC does include the HBRT adjunct partition ID for Service to
identify the adjunct.
- A problem was fixed for a Novalink enabled partition not
being able to release master from the HMC that results in error
HSCLB95B. To resolve the issue, run a rebuild managed server
operation on the HMC and then retry the release. This occurs when
attempting to release master from HMC after the first boot up of a
Novalink enabled partition if Master Mode was enforced prior to the
boot.
- A problem was fixed for an UE memory error causing an
entire LMB of memory to deallocate and guard instead of just one page
of memory.
- A problem was fixed for all variants (this was partially
fixed in an earlier release) for the SR-IOV firmware adapter updates
using the HMC GUI or CLI to only reboot one SR-IOV adapter at a
time. If multiple adapters are updated at the same time, the HMC
error message HSCF0241E may occur: "HSCF0241E Could not read
firmware information from SR-IOV device ...". This fix prevents
the system network from being disrupted by the SR-IOV adapter updates
when redundant configurations are being used for the network. The
problem can be circumvented by using the HMC GUI to update the SR-IOV
firmware one adapter at a time using the following steps: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
- A problem was fixed for a rare hypervisor hang caused by a
dispatching deadlock for two threads of a process. The system
hangs with SRC B17BE434 and SRC B182951C logged. This
failure requires high interrupt activity on a program thread that is
not currently dispatched.
- A problem was fixed for a Virtual Network Interface
Controller (vNIC) client adapter to prevent a failover when disabling
the adapter from the HMC. A failover to a new backing device
could cause the client adapter to erroneously appear to be active again
when it is actually disabled. This causes confusion and failures
on the OS for the device driver. This problem can only occur when
there is more than a single backing device for the vNIC adapter and if
a commands are issued from the HMC to disable the adapter and enable
the adapter.
- A possible performance problem was fixed for workloads that
have a large memory footprint.
- A problem was fixed for error recovery in the timebase
facility to prevent an error in the system time. This is an
infrequent secondary error when the timebase facility has failed
and needs recovery.
- A problem was fixed for the HMC GUI and CLI interfaces
incorrectly showing SR-IOV updates as being available for certain
SR-IOV adapters when no updates are
available. This affects the following PCIe
adapters: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN
58FB; and #EC3L/#EC3M with CCIN 2CEC. The "Update
Available" indication in the HMC can be ignored if updates have already
been applied.
- A problem was fixed for the recovery of certain SR-IOV
adapters that fail with SRC B400FF05. This is
triggered by infrequent EEH errors in the adapter. In the
recovery process, the Virtual Function (VF) for the adapter
is rebuilt into the wrong state, preventing the adapter from
working. An HMC initiated disruptive resource dump of the adapter
can recover it. This problem affects the following PCIe
adapters: #EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN
58FB; and #EC3L/#EC3M with CCIN 2CEC.
- A problem was fixed for SR-IOV Virtual Functions (VFs)
halting transmission with a SRC B400FF01 logged when many logical
partitions with VFs are shutdown at the same time the adapter is in
highly-active usage by a workload. The recovery process reboots
the failed SR-IOV adapter, so no user intervention is needed to restore
the VF.
- A problem was fixed for VLAN-tagged frames
being transmitted over SR-IOV adapter VFs when the packets should have
instead have been discarded for some VF configuration settings on
certain SR-IOV adapters. This affects the following PCIe
adapters:
#EC2R/#EC2S with CCIN 58FA; #EC2T/#EC2U with CCIN 58FB; and
#EC3L/#EC3M with CCIN 2CEC.
- A problem was fixed for SR-IOV adapter hangs with a
possible SRC B400FF01 logged. This may cause a temporary network
outage while the SR-IOV adapter VF reboots to recover from thje adapter
hang. This problem has been observed on systems with high
network traffic and with many VFs defined.
This fix updates adapter firmware to 1x.22.4021 for the
following Feature Codes: EC2R, EC2S, EC2T, EC2U, EC3L and EC3M.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for a large number (approximately
16,000) of DLPAR adds and removes of SR-IOV VFs to
cause a subsequent DLPAR add of the VF to fail with the
newly-added VF not usable. The large number of
allocations and deallocations caused a leak of a critical SR-IOV
adapter resource. The adapter and VFs may be recovered by an
SR-IOV adapter reset.
- A problem was fixed for a system boot hanging when
recoverable attentions occur on the non-master processor. With
the fix, the attentions on the non-master processor are deferred until
Symmetric multiprocessing (SMP) mode has been established (the point at
which the system is ready for multiple processors to run). This
allows the boot to complete but still have the non-master processor
errors recovered as needed.
- A problem was fixed for certain hypervisor error logs being
slow to report to the OS. The error logs affected are those
created by the hypervisor immediately after the hypervisor is started
and if there is more than 128 error logs from the hypervisor to be
reported. The error logs at the end of the queue take a long time
to be processed, and may make it appear as if error logs are not being
reported to the OS.
- A problem was fixed for a Self Boot Engine (SBE) reset
causing the On-Chip Controller (OCC) to force the system into
Safe Mode with a flood of SRC B150DAA0 and SRC B150DA8A written to the
error log as Information Events.
- A problem was fixed for the Redfish "Manager" request
returning duplicate object URIs for the same HMC. This can occur
if the HMC was removed from the managed system and then later added
back in. The Redfish objects for the earlier instances of the
same HMC were never deleted on the remove.
- A problem was fixed for a possible failure to the service
processor stop state when performing a platform dump. This
problem is specific to dumps being collected for HWPROC
checkstops, which are not common.
- A problem was fixed for SMS menus to limit reporting on the
NPIV and vSCSI configuration to the first 511 LUNs. Without the
fix, LUN 512 through the last configured LUN report with invalid
data. Configurations in excess of 511 LUNs are very rare, and it
is recommended for performance reasons (to be able search for the boot
LUN more quickly) that the number of LUNs on a single targeted be
limited to less than 512.
- The following two errors in the SR-IOV adapter firmware
were fixed: 1) The adapter resets and there is a B400FF01
reference code logged. This error
happens in rare cases when there are multiple partitions actively
running traffic through the adapter. System firmware resets the
adapter
and recovers the system with no
user-intervention required; 2) SR-IOV VFs with defined VLANs and an
assigned PVID are not able to ping each other.
This fix updates adapter firmware to 11.2.211.26 for the following
Feature Codes: EN15, EN17, EN0H,
EN0J, EN0M, EN0N, EN0K, EN0L, EL38, EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/en/POWER9/p9efd/p9efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for Field Core Override (FCO) cores
being allocated from a deconfigured processor, causing an IPL failure
with unusable cores. This problem only occurs during the Hostboot
reconfiguration loop in the presence of other processor failures.
- A problem was fixed for a failure in DDR4 RCD (Register
Clock Driver) memory initialization that causes half of the DIMM memory
to be unusable after an IPL. This is an intermittent problem
where the memory can sometimes be recovered by doing another IPL.
The error is not a hardware problem with the DIMM but it is an error in
the initialization sequence needed get the DIMM ready for normal
operations. This supercedes an earlier fix delivered in FW910.01
that intermittently failed to correct the problem.
- A problem was fixed for IBM Product Engineering and Support
personnel not being able to easily determine planar jumper settings in
a machine in order to determine the best mitigation strategies for
various field problems that may occur. With the fix, an
Information Error log is provided on every IPL to provide the planar
jumper settings.
- A problem was fixed for the periodic guard reminder
function to not re-post errorlogs of failed FRUs on each IPL.
Instead, a reminder SRC is created to call home the list of FRUs that
have failed and require service. This puts the system to back to
original behavior of only posting one error log for each FRU that has
failed.
- For a HMC managed system, a problem was fixed for a rare,
intermittent NetsCMS core dump that could occur whenever the system is
doing a deferred shutdown power off. There is no impact to normal
operations as the power off completes, but there are extra error logs
with SRC B181EF88 and a service processor dump.
- A problem was fixed for a Hostboot hang due to deadlock
that can occur if there is a SBE dump in progress that fails. A
failure in the SBE dump can trigger a second SBE dump that deadlocks.
- A problem was fixed for dump performance by decreasing the
amount of time needed to perform dumps by 50%.
- A problem was fixed for an IPL hang that can occur for
certain rare processor errors, where the system is in a loop trying to
isolate the fault.
- A problem was fixed for an enclosure fault LED being stuck
on after a repair of a fan. This problem only occurs after the
second concurrent repair of a fan.
- A problem was fixed for SR-IOV adapters not showing up in
the device tree for a partition that autotboots or starts within a few
seconds of the hypervisor going ready. This problem can be
circumvented by delaying the boot of the partition for at least a
minute after the hypervisor has reached the standby state. If the
problem is encountered, the SR-IOV adapter can be recovered by
rebooting the partition, or DLPAR and remove and add the SR-IOV adapter
to the partition.
- A problem was fixed for a system crash with SRC B700F103
when there are many consecutive configuration changes in the LPARs to
delete old vNICs and create new vNICs, which exposed an infrequent
problem with lock ownership on a virtual I/O slot. There is a
one-to-one mapping or connection between vNIC adapter in the client
LPAR and the backing logical port in the VIOS, and the lock management
needs to ensure that the LPAR accessing the port has ownership to
it. In this case, the LPAR was trying to make usable a device it
did not own. The system should recover on the post dump IPL.
- A problem was fixed for a possible DLPAR add failure of a
PCIe adapter if the adapter is in the planar slot C7 or slot C6 on any
PCIe Expansion drawer fanout module. The problem is more common
if there are other devices or Virtual Functions (VFs) in the same LPAR
that use four interrupts, as this is a problem with the processing
order of the PCIe LSI interrupts.
- A problem was fixed for resource dumps that use the
selector "iomfnm" and options "rioinfo" or "dumpbainfo". This
combination of options for resource dumps always fails without the fix.
- A problem was fixed for missing FFDC data for SR-IOV
Virtual Function (VF) failures and for not allowing the full
architected five minute limit for a recovery attempt for the VF, which
should expand the number of cases where the VF can be recovered.
- A problem was fixed for missing error recovery for memory
errors in non-mirrored memory when reading the SR-IOV adapter firmware,
which could prevent the SR-IOV VF from booting.
- A problem was fixed for a possible system crash if an error
occurs at runtime that requires a FRU guard action. With the fix,
the guard action is restricted to the IPL where it is supported.
- A problem was fixed for a extremely rare IPL hang on a
false communications error to the power supply. Recovery is to
retry the IPL.
- A problem was fixed for the dump content type for HBMEM
(Hostboot memory) to be recognized instead of displaying "Dump Content
Type: not found".
- A problem was fixed for a system crash when an SR-IOV
adapter is changed from dedicated to shared mode with SRC B700FFF and
SRC B150DA73 logged. This failure requires that hypervisor
memory relocation be in progress on the system. This affects the
following PCIe adapters: #EC2R/#EC2S with CCIN 58FA;
#EC2T/#EC2U with CCIN 58FB; and #EC3L/#EC3M with CCIN 2CEC.
- A problem was fixed for a Live Partition Mobility (LPM)
migration of a partition with shared processors that has an unusable
shared processor that can result in failure of the target partition or
target system. This problem can be avoided by making sure all
shared processors are functional in the source partition before
starting the migration. The target partition or system can be
rebooted to recover it.
- A problem was fixed for hypervisor memory relocation and
Dynamic DMA Window (DDW) memory allocation used by I/O adapter slots
for some adapters where the DDW memory tables may not be fully
initialized between uses. Infrequently, this can cause an
internal failure in the hypervisor when moving the DDW memory for the
adapters. Examples of programs using memory relocation are
Live Partition Mobility (LPM) and the Dynamic Platform Optimizer (DPO).
- A problem was fixed for a partition or system termination
that may occur when shutting down or deleting a partition on a system
with a very large number of partitions (more than 400) or on a system
with fewer partitions but with a very large number of virtual adapters
configured.
- A problem was fixed for when booting a large number of
LPARs with Virtual Trusted Platform Module (vTPM) capability, some
partitions may post a SRC BA54504D time-out for taking too long to
start. With the fix, the time allowed to boot a vTPM LPAR is
increased. If a time-out occurs, the partition can be booted
again to recover. The problem can be avoided by auto-starting
fewer vTPM LPARs, or booting them a couple at a time to prevent
flooding the vTPM device server with requests that will slow the boot
time while the LPARs wait on the vTPM device server responses.
- A problem was fixed for a possible system crash.
- A problem was fixed for a UE B1812D62 logged when a PCI
card is removed between system IPLs. This error log can be
ignored.
- A problem was fixed for USB code update failure if the USB
stick is plugged during an AC power cycle. After the power cycle
completes, the code update will fail to start from the USB
device. As a circumvention, the USB device can be plugged in
after the service processor is in its ready state.
- A problem was fixed for a possible slower migration during
the Live Partition Mobility (LPM) resume stage. For a
migrating partition that does not have a high demand page rate, there
is minimal impact on performance. There is no need for customer
recovery as the migration completes successfully.
- A problem was fixed for firmware assisted dumps (fadump)
and Linux kernel crash dumps (kdump) where dump data is missing.
This can happen if the dumps are set up with chunks greater than 1
Gb in size. This problem can be avoided by setting up
fadump or kdump with multiple 1 Gb chunks.
- A problem was fixed for the I2C bus error logged with SRC
BC500705 and SRC BC8A0401 where the I2C bus was locked up. This
is an infrequent error. In rare cases, the TPM device may hold down the
I2C clock line longer than allowed, causing an error recovery that
times out and prevents the reset from working on all the I2C engine's
ports. A power off and power on of the system should clear the
bus error and allow the system to IPL.
- A problem was fixed for an intra-node, inter-processor
communication lane failure marked in the VPD, causing a secure boot
blacklist violation on the IPL and a processor to be deconfigured with
an SRC BC8A2813 logged.
- A problem was fixed to capture details of failed FRUs into
the dump data by delaying the deconfiguration of the FRUs for checkstop
and TI attentions.
- A problem was fixed for failed processor cores not being
guarded on a memory preserving IPL (re-IPL with CEC powered on).
- A problem was fixed for debug data missing in dumps for
cores which are off-line.
- A problem was fixed for L3 cache calling out a LRU Parity
error too quickly for hardware that is good. Without the fix,
ignore the L3FIR[28] LRU Parity errors unless they are very persistent
with 30 or more occurrences per day.
- A problem was fixed for not having a FRU callout when the
TPM card is missing and causes an IPL failure.
- A problem was fixed for the Advanced System Management
Interface (ASMI) displaying the IPv6 network prefix in decimal instead
of hex character values. The service processor command line
"ifconfig" can be used to see the IPv6 network prefix value in hex as a
circumvention to the problem.
- A problem was fixed for an On-Chip Controller (OCC) cache
fault causing a loss of the OCC for the system without the system
dropping into Safe mode.
- A problem was fixed for system dump failing to collect the
pu.perv SCOMs for chiplets c16 and above which correspond to EQ and EC
chiplets.
Also fixed was the missing SCOM data for the interrupt unit related
"c_err_rpt" registers.
- A problem was fixed for the PCIe topology reports having
slots missing in the "I/O Slot Locations" column in the row for the bus
representing a PCIe switch. This only occurs when the C49
or C50 slots are bifurcated (a slot having two channels).
Bifurcation is done if an NVME module is in the slot or if the slot is
empty (for certain levels of backplanes).
- A problem was fixed for Live Partition Mobility (LPM)
failing along with other hypervisor tasks, but the partitions continue
to run. This is an extremely rare failure where a re-IPL is
needed to restore HMC or Novalink connections to the partitions, or to
do any system configuration changes.
- A problem was fixed for a system termination during a
concurrent exchange of a SR-IOV adapters that had VFs assigned to
it. For this problem, the OS failed to release the VFs but the
error was not returned to the HMC. With the fix, the FRU exchange
gracefully aborts without impacting the system for the case where the
VFs on the SR-IOV adapter remain active.
- A possible performance problem was fixed for partitions
with shared processors that had latency in the handling of the
escalation interrupts used to switch the processor between tasks.
The effect of this is that, while the processor is kept busy, some
tasks might hold the processor longer than they should because the
interrupt is delayed, while others run slower than normal.
- A problem was fixed for a system termination that may occur
with B111E504 logged when starting a partition on a system with a very
large number of partitions (more than 400) or on a system with fewer
partitions but with a very large number of virtual adapters configured.
- A problem was fixed for a system termination that may occur
with a B150DA73 logged when a memory UE is encountered in a partition
when the hypervisor touches the memory. With the fix, the touch
of memory by the hypervisor is a UE tolerant touch and the system is
able to continue running.
- A problem was fixed for fabric errors such as cable pulls
causing checkstops. With the fix, the PBAFIR are changed to
recoverable atentions, allowing the OCC to be reset to recover from
such faults
System firmware changes that affect certain systems
- A problem was fixed to remove a SAS battery LED from ASMI
that does not exist. This problem only pertains to the
S914(9009-41A), S924 (9009-42A) and H924 for SAP HANA (9223-42H) models.
- On a system with an AIX partition, a problem was
fixed for a partition time jump that could occur after doing an AIX
Live Update. This problem could occur if the AIX Live Update
happens after a Live Partition Mobility (LPM) migration to the
partition. AIX applications using the timebase facility could
observe a large jump forwards or backwards in the time reported by the
timebase facility. A circumvention to this problem is to
reboot the partition after the LPM operation prior to doing the AIX
Live Update. An AIX fix is also required to resolve this
problem. The issue will no longer occur when this firmware update
is applied on the system that is the target of the LPM operation and
the AIX partition performing the AIX Live Update has the appropriate
AIX updates installed prior to doing the AIX Live Update.
- On a Linux or IBM i partition which has just
completed a Live Partition Mobility (LPM) migration, a problem was
fixed for a VIO adapter hang when it stops processing interrupts.
For this problem to occur, prior to the migration the adapter must have
had a interrupt outstanding where the interrupt source was disabled.
- On systems with an IBM i partition, support was added
for multipliers for IBM i MATMATR fields that are limited to four
characters. When retrieving Server metrics via IBM MATMATR calls,
and the system contains greater than 9999 GB, for example, MATMATR has
an architected "multiplier" field such that 10,000 GB can be
represented
by 5,000 GB * Multiplier of 2, so '5000' and '2' are returned in
the quantity and multiplier fields, respectively, to handle these
extended values. The IBM i OS also requires a PTF to support the
MATMATR field multipliers.
- On a system with a IBM i partition with more than 64 shared
processors assigned to it, a problem was fixed for a system
termination or other unexpected behavior that may occur during a
partition dump. Without the fix, the problem can be avoided by
limiting the IBM i partition to 64 or fewer shared processors.
|