SC830
For Impact, Severity and other Firmware definitions, Please
refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs
The complete Firmware Fix History for this
Release Level can be
reviewed at the following url:
http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/SC-Firmware-Hist.html
|
SC830_106_048 / FW830.50
04/27/17 |
Impact:
Availability Severity: SPE
New features and functions
- Support for the Advanced System Management Interface (ASMI)
was changed to allow the special characters of "I", "O", and "Q" to be
entered for the serial number of the I/O Enclosure under the Configure
I/O Enclosure option. These characters have only been found in an
IBM serial number rarely, so typing in these characters will normally
be an incorrect action. However, the special character entry is
not blocked by ASMI anymore so it is able to support the exception
case. Without the enhancement, the typing of one of the special
characters causes message "Invalid serial number" to be displayed.
- Support was added for the Universally Unique
IDentifier (UUID) property for each partition. The UUID provides
each partition with an identifier that is persisted by the platform
across partition reboots, reconfigurations, OS reinstalls, partition
migration, and hibernation.
System firmware changes that affect all systems
- A problem was fixed
for System Vital Product Data (SVPD) FRUs being guarded but not
having a corresponding error log entry. This is a failure to
commit the error log entry that has occurred only rarely.
- A problem was fixed for a system going into safe mode with
SRC B1502616 logged as informational without a call home
notification. Notification is needed because the system is
running with reduced performance. If there are unrecoverable
error logs and any are marked with reduced performance and the system
has not been rebooted, then the system is probably running in safe mode
with reduced performance. With the fix, the SRC B1502616 is a
Unrecoverable Error (UE).
- A problem was fixed for the PCIe3 Optical Cable Adapter for
the PCIe3 Expansion Drawer failing with SRC B7006A84 error logged
during the IPL. The failed cable adapter can be recovered by
using a concurrent repair operation to power it off and on.
Or the system can be re-IPLed to recover the cable adapter.
The affected optical cable adapters have feature codes #EJ05, #EJ06,
and #EJ08 with CCINs 2B1C, 6B52, and 2CE2, respectively.
- A problem was fixed for PCIe Host Bridge (PHB) outages and
PCIe adapter failures in the PCIe I/O expansion drawer caused by error
thresholds being exceeded for the LEM bit [21] errors in the FIR
accumulator. These are typically minor and expected errors in the
PHB that occur during adapter updates and do not warrant a reset
of the PHB and the PCIe adapter failures. Therefore, the
threshold LEM[21] error limit has been increased and the LEM fatal
error has been changed to a Predictive Error to avoid the outages for
this condition.
- A problem was fixed for PCIe3 I/O expansion drawer (#EMX0)
link improved stability. The settings for the continuous time
linear equalizers (CTLE) was updated for all the PCIe adapters for the
PCIe links to the expansion drawer. The CEC must be re-IPLed for
the fix to activate.
- The following problems were fixed for SR-IOV adapters:
1) Insufficient resources reported for SR-IOV logical port configured
with promiscuous mode enable and a Port VLAN ID (PVID) when creating
new interface on the SR-IOV adapters.
2) Spontaneous dumps and reboot of the adjunct partition for SR-IOV
adapters.
3) Adapter enters firmware loop when single bit ECC error is
detected. System firmware detects this condition as a adapter
command time out. System firmware will reset and restart the
adapter to recover the adapter functionality. This condition will
be reported as a temporary adapter hardware failure.
4) vNIC interfaces not being deleted correctly causing SRC
B400FF01 to be logged and Data Storage Interrupt (DSI) errors with
failiure on boot of the LPAR.
This set of fixes updates adapter firmware to 10.2.252.1926, for the
following Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EN0M,
EN0N, EN0K, EN0L, EL38 , EL3C, EL56, and EL57.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for Live Partition Mobility (LPM)
migrations from FW860.10 or FW860.11 to older levels of firmware.
Subsequent DLPAR of Virtual Adapters will fail with HMC error
message HSCL294C, which contains text similar to the following:
"0931-007 You have specified an invalid drc_name." This issue affects
partitions installed with AIX 7.2 TL 1 and later. Not affected by this
issue are partitions installed with VIOS, IBM i, or earlier levels of
AIX.
- A problem was fixed for incorrect callouts of the Power
Management Controller (PMC) hardware with SRC B1112AC4 and SRC
B1112AB2 logged. These extra callouts occur when the On-Chip
Controller (OCC) has placed the system in the Safe mode state for a
prior failure that is the real problem that needs to be resolved.
- A problem was fixed for a failure in launching the Advanced
System Management Interface (ASMI) from the HMC local console for the
HMC levels of V8R8.3.0 SP2 and V8R8.4.0 SP1. There was a frozen
window displayed instead of the ASMI login panel. A
circumvention to the problem is to connect to ASMI from a remote
browser session.
- A problem was fixed for the Advanced System Management
Interface (ASMI) "System Service Aids => Error/Event Logs" panel not
showing the "Clear" and "Show" log options and also having a truncated
error log when there are a large number of error logs on the system.
- A problem was fixed for sporadic blinking amber LEDs for
the system fans with no SRCs logged. There was no problem with
the fans. The LED corruption occurred when two service processor
tasks attempted to update the LED state at the same time. The fan
LEDs can be recovered to a normal state concurrently using the
following link steps for a soft reset of the service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
- A problem was fixed for hardware dumps only collecting data
for the master processor if a run-time service processor failover had
occurred prior to the dump. Therefore, there would be only master
chip and master core data in the event of a core unit checkstop.
To recover to a system state that is able to do a full collection of
debug data for all processors and cores after a run-time failover, a
re-IPL of the system is needed.
- A problem was fixed for the loss of Operations Panel
function 30 (displaying ethernet port HMC1 and HMC2 IP addresses)
after a concurrent repair of the Operations Panel.
Operations Panel function 30 can be restored concurrently using
the following link steps for a soft reset of the service
processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm
- A problem was fixed for the service processor boot
watch-dog timer expiring too soon during DRAM initialization in the
reset/reload, causing the service processor to go unresponsive.
On systems with a single service processor, the SRC B1817212 was
displayed on the control panel. For systems with redundant
service processors, the failing service processor was
deconfigured. To recover the failed service processor, the system
will need to be powered off with AC powered removed during a regularly
scheduled system service action. This problem is intermittent and
very infrequent as most of the reset/reloads of the service processor
will work correctly to restore the service processor to a normal
operating state.
- A problem was fixed for host-initiated resets of the
service processor causing the system to terminate. A prior fix
for this problem did not work correctly because some of the
host-initiated resets were being translated to unknown reset types that
caused the system to terminate. With this new correction for
failed host-initiated resets, the service processor will still be
unresponsive but the system and partitions will continue to run.
On systems with a single service processor, the SRC B1817212 will be
displayed on the control panel. For systems with redundant
service processors, the failing service processor will be
deconfigured. To recover the failed service processor, the system
will need to be powered off with AC powered removed during a regularly
scheduled system service action. This problem is intermittent and
very infrequent as most of the host-initiated resets of the service
processor will work correctly to restore the service processor to a
normal operating state.
- A problem was fixed for incorrect error messages from the
Advanced System Management Interface (ASMI) functions when the system
is powered on but in the "Incomplete State". For this
condition, ASMI was assuming the system was powered off because it
could not communicate to the PowerVM hypervisor. With the fix,
the ASMI error messages will indicate that ASMI functions have failed
because of the bad hypervisor connection instead of falsely stating
that the system is powered off.
- A problem was fixed for a single node failure on a
multi-node system preventing an IPL. The error occurred if
Hostboot hung on a node and timed out without calling out problem
hardware. With the fix, a service processor failover is used to
IPL on an alternate path to recover from the error. And an error
log has been added for the IPL timeout for the node with SRC B111BAAB
and a callout for the master processor and PNOR.
- A problem was fixed for the System Attention LED failing to
light for an error failover for the redundant service processors with a
SRC B1812028 logged.
System firmware changes that affect certain systems
- On systems with PCIe adapters in Single Root I/O
Virtualization (SR-IOV) shared mode, a problem was fixed for the
hypervisor SR-IOV adjunct partition failing during the IPL with SRCs
B200F011 and B2009014 logged. The SR-IOV adjunct partition successfully
recovers after it reboots and the system is operational.
- On systems with maximum memory configurations (where every
DIMM slot is populated - size of DIMM does not matter), a problem
has been fixed for systems losing performance and going into Safe mode
(a power mode with reduced processor frequencies intended to protect
the system from over-heating and excessive power consumption) with
B1xx2AC3/B1xx2AC4 SRCs logged. This happened because of
On-Chip Controller (OCC) time out errors when collecting Analog Power
Subsystem Sweep (APSS) data, used by the OCC to tune the processor
frequency. This problem occurs more frequently on systems that
are running heavy workloads. Recovery from Safe mode back to
normal performance can be done with a re-IPL of the system, or
concurrently using the following link steps for a soft reset of the
service processor: https://www.ibm.com/support/knowledgecenter/POWER8/p8hby/p8hby_softreset.htm.
To check or validate that Safe mode is not active on the system will
require a dynamic celogin password from IBM Support to use the service
processor command line:
1) Log into ASMI as celogin with dynamic celogin password
generated by IBM Support
2) Select System Service Aids
3) Select Service Processor Command Line
4) Enter "tmgtclient --query_mode_and_function" from the command line
The first line of the output, "currSysPwrMode" should say "NOMINAL" and
this means the system is in normal mode and that Safe mode is not
active.
|
SC830_101_048 / FW830.40
12/08/16 |
Impact:
Availability Severity: ATT
New features and functions
- Support for the Advanced System Management Interface (ASMI)
was changed to not create VPD deconfiguration records and call home
alerts for hardware FRUs that have one VPD chip of a redundant pair
broken or inaccessible. The backup VPD chip for the FRU allows
continued use of the hardware resource. The notification of the
need for service for the FRU VPD is not provided until both of the
redundant VPD chips have failed for a FRU.
- Support was added for systems to be able to automatically
convert permanently activated resources (processor and memory) to
Mobile CoD resources for use in a Power Enterprise Pool (PEP).
The ability to do a CoD resource license conversion requires a minimum
HMC level of V8R8.4.0 or later. More information on how to use
a PEP for a group of systems tp share Mobile Capacity on Demand
(CoD) processor resources and memory resources can be found in the IBM
Knowledge Center at the following link: https://www.ibm.com/support/knowledgecenter/HW4M4/p8ha2/systempool_cod.htm
System firmware changes that affect all systems
- A problem was fixed
the for an infrequent IPL hang and terminate that can occur if the
backup clock card is failing. The following SRCs may be logged
with this termination: B1813450, B181460B, B181BA07, B181E6C7 and
B181E6F1. If the IPL error occurs, the system can be re-IPLed to
recover from the problem.
- A problem was fixed for an infrequent service processor
failover hang that results in a reset of the backup service processor
that is trying to become the new primary. This error occurs more
often on a failover to a backup service processor that has been in that
role for a long period of time (many months). This error can
cause a concurrent firmware update to fail. To reduce the chance
of a firmware update failure because of a bad failover, an
Administrative Failover (AFO) can be requested from the HMC prior to
the start of the firmware update. When the AFO has completed, the
firmware update can be started as normally done.
- A problem was fixed for the loss of the setting for the
disable of a periodic notification for a call home error log after a
failover to the backup service processor on a redundant service
processor system. The call home for the presence of a failed
resource can get re-enabled (if manually disabled in ASMI on the
primary service processor) after a concurrent firmware update or any
scenario that causes the service processor to fail over and change
roles. With the fix, the periodic notification flag is
synchronized between the service processors when the flag value is
changed.
- A problem was fixed for On-Chip Controller (OCC) errors
that had excessive callouts for processor FRUs. Many of the OCC
errors are recoverable and do not required that the processor be called
out and guarded. With the fix, the processors will only be called
out for OCC errors if there are three or more OCC failures during a
time period of a week.
- A problem was fixed for an Operations Panel Function 04
(Lamp test) during an IPL causing the IPL to fail. With the fix,
the lamp test request is rejected during the IPL until the hypervisor
is available. The lamp test can be requested without problems
anytime after the system is powered on to hypervisor ready or an OS is
running in a partition.
- A problem was fixed for a 3.3V power fault on the primary
system clock card causing a failover to the backup clock without an
error log and a call out for the primary clock card. This clock
card is part of a redundant set in the System Control Unit with CCIN
6B49.
- A problem was fixed for a Phased Locked Loop (PLL) unlock
error on the backup clock card by using spread spectrum to maintain the
phased locked loop for the clock frequency. This technique was
already in use for the primary clock card. The PLL unlock error
is rare in the backup clock for the Power systems but it has been seen
more frequently for the same part in other IBM systems. This
clock card is part of a redundant set in the System Control Unit with
CCIN 6B49.
- A problem was fixed for infrequent VPD cache read failures
during an IPL causing an unnecessary guarding of DIMMs with SRC
B123A80F logged. With the fix, the VPD cache read fails cause a
temporary deconfiguration of the associated DIMM but the DIMM is
recovered on the next IPL.
- A problem was fixed for extra resources being assigned in a
Power Enterprise Pool (PEP). This only occurs if all of
these things happen:
o Power server is in a PEP pool
o Power server has PEP resources assigned to it
o Power server powered down
o User uses HMC to 'remove' resources from the powered-down
server
o Power server is then restarted. It should come up with no
PEP resources, but it starts up and shows it still is using PEP
resources it should not have.
To recover from this problem, the HMC 'remove' of the PEP resources
from the server can be performed again.
- A problem was fixed for a Live Partition Mobility
(LPM) error where the target partition migration is failed with
HSCLB98C error. Frequency of this error can be moderate with
source partitions that have a vNIC resource but extremely low if the
source partition does not have a vNIC resource. The failure
originates at the VIOS VF level, so recovery from this error may need a
re-IPL of the system to regain full use of the vNIC resources.
- A problem was fixed for a latency time of about 2 seconds
being added to a target Live Partition Mobility (LPM) migration system
when there is a latency time check failure. With the fix, in the
case of a latency time check failure, a much smaller default latency is
used instead of two seconds. This error would not be noticed if
the customer system is using a NTP time server to maintain the time.
- A problem was fixed for a system dump post-dump IPL that
resulted in adjunct partition errors of SRC BA54504D, B7005191, and
BA220020 when they could not be created due to false space
constraints. These adjunct partition failures will prevent normal
operations of the hypervisor such as creating new partitions, so a
power off and power on of the system is needed to recover it. If
the customer system is experiencing this error (only some systems will
be impacted), it is expected to occur for each system dump post-dump
IPL until the fix is applied.
- A problem was fixed for a shared processor pool partition
showing an incorrect zero "Available Pool Processor" (APP) value after
a concurrent firmware update. The zero APP value means that no
idle cycles are present in the shared processor pool but in this case
it stays zero even when idle cycles are available. This value can
be displayed using the AIX "lparstat" command. If this problem is
encountered, the partitions in the affected shared processor pool can
be dynamically moved to a different shared processor pool. Before
the dynamic move, the "uncapped" partitions should be changed to
"capped" to avoid a system hang. The old affected pool would continue
to have the APP error until the system is re-IPLed.
- A rare problem was fixed for a system hang that can
occur when dynamically moving "uncapped" partitions to a
different shared processor pool. To prevent a system hang, the
"uncapped" partitions should be changed to "capped" before doing the
move.
- A problem was fixed for a DLPAR add of the USB 3.0 adapter
(#EC45 and #EC46) to an AIX partition where the adapter could not be
configured with the AIX "cfgmgr" command that fails with EEH
errors and an outstanding illegal DMA transaction. The trigger
for the problem is the DLPAR add operation of the USB 3.0 adapter that
has a USB External Dock (#EU04) and RDX Removable Disk Drives attached,
or a USB 3.0 adapter that has a flash driver attached. The PCI
slot can be powered off and on to recover the USB 3.0 adapter.
- A problem was fixed for network issues, causing critical
situations for customers, when an SR-IOV logical port or vNIC is
configured with a non-zero Port VLAN ID (PVID). This fix updates
adapter firmware to 10.2.252.1922, for the following Feature Codes:
EN15, EN16, EN17, EN18, EN0H, EN0J, EL38, EN0M, EN0N, EN0K, EN0L, and
EL3C.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm
Note: Adapters that are capable of running in SR-IOV mode, but are
currently running in dedicated mode and assigned to a partition, can be
updated concurrently either by the OS that owns the adapter or the
managing HMC (if OS is AIX or VIOS and RMC is running).
- A problem was fixed for a failed IPL with SRC UE BC8A090F
that does not have a hardware callout or a guard of the failing
hardware. The system may be recovered by guarding out the
processor associated with the error and re-IPLing the system.
With the fix, the bad processor core is guarded and the system is able
to IPL.
- A problem was fixed for the On-Chip Controller (OCC)
incorrectly calling out processors with SRC B1112A16 for L4 Cache DIMM
failures with SRC B124E504. This false error logging can occur if
the DIMM slot that is failing is adjacent to two unoccupied DIMM slots.
- A problem was fixed for host-initiated resets of the
service processor that can cause the service processor to
terminate. In this state, the service processor will be
unresponsive but the system and partitions will continue to run.
On systems with a single service processor, the SRC B1817212 will be
displayed on the control panel. For systems with redundant
service processors, the failing service processor will be
deconfigured. To recover the failed service processor, the system
will need to be powered off with AC powered removed during a regularly
scheduled system service action. The problem is intermittent and
very infrequent as most of the host-initiated resets of the service
processor will work correctly to restore the service processor to a
normal operating state.
- A problem was fixed for device time outs during a IPL
logged with a SRC B18138B4. This error is intermittent and no
action is needed for the error log. The service processor
hardware server has allotted more time of the device transactions to
allow the transactions to complete without a time-out error.
- A problem was fixed for cable card capable PCI slots
that fail during the IPL. Hypervisor I/O Bus Interface UE
B7006A84 is reported for each cable card capable PCI slot that
doesn't contain a PCIe3 Optical Cable Adapter for the PCIe Expansion
Drawer (feature code #EJ05). PCI slots containing a cable card
will not report an error but will not be functional. The problem
can be resolved by performing an AC cycle of the system. The
trigger for the failure is the I2C devices used to detect the cable
cards are not coming out of the power on reset process in the correct
state due to a race condition.
- A problem was fixed with SR-IOV adapter error recovery
where the adapter is left in a failed state in nested error cases for
some adapter errors. The probability of this occurring is very
low since the problem trigger is multiple low-level adapter
failures. With the fix, the adapter is recovered and returned to
an operational state.
- A problem was fixed for the setting the disable of a
periodic notification for a call home error log SRC B150F138 for Memory
Buffer resources (membuf) from the Advanced System Management Interface
(ASMI).
- A problem was fixed for a blank SRC in the LPA dump for
user-initiated non-disruptive adjunct dumps. The SRC is needed
for problem determination and dump analysis.
- A problem was fixed for a missing processor FRU callout for
SRC BC8A0307 for a node deconfiguration during the IPL. The
failing SCM is now provided on the callout when this error occurs
during the IPL. This callout allows the guard of the
failing processor to occur so that the IPL is successful.
System firmware changes that affect certain systems
- On systems using the PowerVM hypervisor firmware and
Novalink, a problem was fixed for a NovaLink installation error where
the hypervisor was unable to get the maximum logical memory buffer
(LMB) size from the service processor. The maximum supported LMB
size should be 0xFFFFFFFF but in some cases it was initialized to a
value that was less than the amount of configured memory, causing the
service processor read failure with error code 0X00000134.
- On systems that have an attached HMC, a problem was
fixed for a Live Partition Mobility migration that resulted in the
source managed system going to the Hardware Management Console (HMC)
Incomplete state after the migration to the target system was
completed. This problem is very rare and has only been detected
once.. The problem trigger is that the source partition does not halt
execution after the migration to the target system. The HMC
went to the Incomplete state for the source managed system when it
failed to delete the source partition because the partition would not
stop running. When this problem occurred, the customer network
was running very slowly and this may have contributed to the
failure. The recovery action is to re-IPL the source system but
that will need to be done without the assistance of the HMC. For
each partition that has a OS running on the source system, shut down
each partition from the OS. Then from the Advanced System
Management Interface (ASMI), power off the managed system.
Alternatively, the system power button may also be used to do the power
off. If the HMC Incomplete state persists after the power off,
the managed system should be rebuilt from the HMC. For more
information on HMC recovery steps, refer to this IBM Knowledge Center
link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm
- On systems that have an attached HMC, a problem was
fixed for a Live Partition Mobility migration that resulted in a system
hang when an EEH error occurred simultaneously with a request for a
page migration operation. On the HMC, it shows an incomplete
state for the managed system with reference code A181D000. The
recovery action is to re-IPL the source system but that will need to be
done without the assistance of the HMC. From the Advanced System
Management Interface (ASMI), power off the managed system.
Alternatively, the system power button may also be used to do the power
off. If the HMC Incomplete state persists after the power off,
the managed system should be rebuilt from the HMC. For more
information on HMC recovery steps, refer to this IBM Knowledge Center
link: https://www.ibm.com/support/knowledgecenter/en/POWER7/p7eav/aremanagedsystemstate_incomplete.htm
|
SC830_097_048 / FW830.30
08/24/16 |
Impact:
Availability Severity: SPE
New features and functions
- The certificate store on the service processor has been
upgraded to include the changes contained in version 2.6 of the CA
certificate list published by the Mozilla Foundation at the mozilla.org
website as part of the Network Security Services (NSS) version 3.21.
- Support was added to the Advanced System Management
Interface (ASMI) for the Intelligent Platform Machine Interface (IPMI)
to be able to change the IPMI password. On the "Login
Profile/Change Password" menu, a user ID of "IPMI" can be
selected. Changing the password for IPMI changes the password for
the default IPMI user ID. IPMI is not a user ID for logging into
ASMI. The IPMI function on the service processor can be accessed
using tool "ipmitool" from a client system that has a network
connection to the service processor.
- Support was added to protect the service processor from
booting on a level of firmware that is below the minimum MIF
level. If this is detected, a SRC B18130A0 is logged. A
disruptive firmware update would then need to be done to the minimum
firmware level or higher. This new support has no effect on the
system being updated with the service pack but has been put in place to
provide an enhanced firmware level for the IBM field stock service
processors.
- Support was added for the Stevens6+ option of the internal
tray loading DVD-ROM drive with F/C #EU13. This is an 8X/24X(max)
Slimline SATA DVD-ROM Drive. The Stevens6+ option is a FRU
hardware replacement for the Stevens3+. MTM 7226-1U3
(Oliver) FC 5757/5762/5763 attaches to IBM Power Systems and
lists Stevens6+ as optional for Stevens3+. If the Stevens6+
DVD drive is installed on the system without the required firmware
support, the boot of an AIX partition will fail when the DVD is used as
the load source. Also, an IBM i partition cannot consistently
boot from the DVD drive using D-mode IPL. A SRC C2004130 may be
logged for the load source not found error.
System firmware changes that affect all systems
- DEFERRED: A
performance
improvement was made by disabling the Hot/Cold Affinity (HCA) hardware
feature which gathers memory usage statistics for consumption by
partition operating system memory management algorithms. The
statistics gathering can, in rare cases, cause performance to
degrade.
The workloads that may experience issues are memory-intensive workloads
that have little locality of reference and thus cannot take advantage
of hardware memory cache. As a consequence, the problem occurs
very
infrequently or not at all except for very specific workloads in a HPC
environment. This performance fix requires an IPL of the system
to
activate it after it is applied.
- A problem was fixed for the service processor going to the
reset state instead of the termination state when the anchor card is
missing or broken. At the termination state, the Advanced System
Management Interface (ASMI) can be used to collect failure data and
debug the problem with the anchor card.
- A problem was fixed for error log entries created by
Hostboot not getting written to the error log in some situations.
This can cause hardware detected as failed by Hostboot to not get
reported or have a call-home generated. This problem will occur
whenever Hostboot commits a recovered or informational error as its
last error log in the current IPL. In the next IPL, one or
more error logs from Hostboot will be lost.
- A problem was fixed for the Hardware Management Console
(HMC) "chpwrmgmt" command not providing a meaningful error message when
used to try to enable an invalid power saver mode of
"dynamic_favor_power" on the 9119-MME or 9119-MHE models. This
power saver mode is not available on these models but the error message
issued was "HSCL1400 An error has occurred during the operation to the
managed system. Try the task again." The following is the
corrected error message: "HSCL1402 This operation failed due to
the following reasons: HSCL02F3 The managed system does not support the
specified power saver mode."
- A problem was fixed for the health monitoring of the NVRAM
and DRAM in the service processor that had been disabled. The
monitoring has been re-established and early warnings of service
processor memory failure is logged with one of the following Predictive
Error SRCs: B151F107, B151F109, B151F10A, or B151F10D.
- A problem was fixed for an incorrect date in
partitions created with a Simplified Remote Restart-Capable (SRR)
attribute where the date is created as Epoch 01/01/1970
(MM/DD/YYYY). Without the fix, the user must change the partition
time of day when starting the partition for the first time to make it
correct. This problem only occurs with SRR partitions.
- A problem was fixed for hypervisor task failures in adjunct
partitions with a SRC B7000602 reported in the error log. These
failures occur during adjunct partition reboots for concurrent firmware
updates but are extremely rare and require a re-IPL of the system to
recover from the task failure. The adjunct partitions may be
associated with the VIOS or I/O virtualization for the physical
adapters such as done for SR-IOV.
- A problem was fixed for a shortened "Grace Period" for "Out
of Compliance" users of a Power Enterprise Pool (PEP). The
"Grace Period" is short by one hour, so the user has one less hour to
resolve compliance issues before the HMC disallows any more borrowing
of PEP resources. For example, if the "Grace Period" should have
been 48 hours as shown in the "Out of Compliance" message, it really is
47 hours in the hypervisor firmware. The borrowing of PEP
resources is not a common usage scenario. It is most often found
in Live Partition Mobility (LPM) migrations where PEP resources are
borrowed from the source server and loaned to the target server.
- A problem was fixed for an AIX or Linux partition failing
with a SRC B2008105 LP 00005 on a re-IPL after a dump (firmware
assisted or error generated dump) following a Live Partition Mobility
(LPM) migration operation. The problem does not occur if the
migrated partition completes a normal IPL after the migration.
- A problem was fixed for intermittent long delays in the NX
co-processor for asynchronous requests such as NX 842
compressions. This problem was observed for AIX DB2 when it was
doing hardware-accelerated compressions of data but could occur on any
asynchronous request to the NX co-processor.
- A problem was fixed for transmit time-outs on a Virtual
Function (VF) during stressful network traffic, on systems using PCIe
adapters in Single Root I/O Virtualization (SR-IOV) shared-mode.
This fix updates adapter firmware to 10.2.252.1918, for the following
Feature Codes: EN15, EN16, EN17, EN18, EN0H, EN0J, EL38, EN0M, EN0N,
EN0K, EN0L, and EL3C.
The SR-IOV adapter firmware level update for the shared-mode adapters
happens under user control to prevent unexpected temporary outages on
the adapters. A system reboot will update all SR-IOV shared-mode
adapters with the new firmware level. In addition, when an
adapter is first set to SR-IOV shared mode, the adapter firmware is
updated to the latest level available with the system firmware (and it
is also updated automatically during maintenance operations, such as
when the adapter is stopped or replaced). And lastly, selective
manual updates of the SR-IOV adapters can be performed using the
Hardware Management Console (HMC). To selectively update the
adapter firmware, follow the steps given at the IBM Knowledge Center
for using HMC to make the updates: https://www.ibm.com/support/knowledgecenter/HW4M4/p8efd/p8efd_updating_sriov_firmware.htm.
Note: Adapters that are capable of running in SR-IOV mode, but
are currently running in dedicated mode and assigned to a partition,
can only be updated concurrently by the OS that owns the adapter.
- A security problem was fixed in OpenSSL for a possible
service processor reset on a null pointer de-reference during SSL
certificate management. The Common Vulnerabilities and Exposures issue
number is CVE-2016-0797.
- A problem was fixed for missing dumps for service
processor failures during firmware updates.
- A problem was fixed for a service processor failure during
a system power off that causes a reset of the service processor.
The service processor is in the correct state for a normal system power
on after the error. The frequency for this error should be low as
it is caused by a very rare race condition in the power off process.
- A problem was fixed for a processor hang where the error
recovery was not guarding the failing processor. The failure
causes a SRC B111E540 to be logged with Signature Description of "
ex(n0p3c1) (COREFIR[55]) NEST_HANG_DETECT: External Hang
detected". With the fix, the failure processor FRU is called out
and guarded so that the error does not re-occur when the system is
re-IPLed.
- A problem was fixed for a sequence of two or more Live
Partition Mobility migrations that caused a partition to crash with a
SRC BA330000 logged (Memory allocation error in partition
firmware). The sequence of LPM migrations that can trigger the
partition crash are as follows:
The original source partition level can be any FW760.xx, FW763.xx,
FW770.xx, FW773.xx, FW780.xx, or FW783.xx P7 level or any FW810.xx,
FW820.xx, FW830.xx, or FW840.xx P8 level. It is migrated first to
a system running one of the following levels:
1) FW730.70 or later 730 firmware or
2) FW740.60 or later 740 firmware
And then a second migration is needed to a system running one of the
following levels:
1) FW760.00 - FW760.20 or
2) FW770.00 - FW770.10
The twice-migrated system partition is now susceptible to the BA330000
partition crash during normal operations until the partition is
rebooted. If an additional LPM migration is done to any firmware
level, the thrice-migrated partition is also susceptible to the
partition crash until it is rebooted.
With the fix applied, the susceptible partitions may still log multiple
BA330000 errors but there will be no partition crash. A reboot of
the partition will stop the logging of the BA330000 SRC.
- A problem was fixed for the Advanced System Management
Interface "Network Services/Network Configuration" "Reset Network
Configuration" button that was not resetting the static routes to the
default factory setting. The manufacturing default is to have no
static routes defined so the fix clears any static routes that had been
added. A circumvention to the problem is to use the ASMI "Network
Services/Network Configuration/Static Route Configuration" "Delete"
button before resetting the network configuration.
- A problem was fixed for a partial callout for a failed
SPIVID (Serial Peripheral Interface Voltage Identification) interface
on the power supply VRM (Voltage Regulator Module). The SPVID
interface allows the processor to to control it's external voltage
supply level, but if it fails, only the processor FRU (SCM) is called
out but not the VRM.
The system IPL will complete with a CEC drawer deconfigured. The
error log will only contain the processor but not the defective
processor VRM. Hostboot does not detect a SPIVID error, but fails
on a SCOM operation to the processor chip. The errors show up
with SRC BCxx090F logged by Hostboot and word 7 containing one of
three error values for a SPIVID_SLAVE_PART callout:
1) RC_SBE_SET_VID_TIMEOUT = 0x005ec1b2
2) RC_SBE_SPIVID_STATUS_ERROR = 0x00902aac
3) RC_SBE_SPIVID_WRITE_RETURN_STATUS_ERROR = 0x0045d3cd with HWP Error
description : "Procedure: proc_sbe_setup_evid SPIVID Device did not
return good status the Boot Voltage Write operation" and HWSV RC of
BA24.
Without the fix, replace both the identified SCM and the associated VRM.
- A problem was fixed for the HMC Exchange FRU procedure for
DVD drive with MTM 7226-1U3 and feature codes 5757/5762/5763 where it
did not verify the DVD drive was plugged in at the end of the exchange
procedure. Without the fix, the user must manually verify
that the DVD drive is plugged in.
- A problem was fixed for the Advanced System Mangement
Interface (ASMI) incorrectly showing the Anchor card as guarded
whenever any redundant VPD chip is guarded.
System firmware changes that affect certain systems
- A problem was fixed for the service processor recovery
from intermittent MAX31760 fan controller faults logged with SRC
B1504804. The fan controller faults caused an out of memory
condition on the service processor, forcing it to reset and failover to
the backup service processor with SRCs B181720D, B181E6E9, and
B182951C logged. With the fix, the fan controller faults are
handled without memory loss and the only SRC logged is B1504804 for
each fan controller fault.
- On systems with a PowerVM Active Memory Sharing (AMS)
partition with AIX Level 7.2.0.0 or later with Firmware Assisted
Dump enabled, a problem was fixed for a Restart Dump operation failing
into KDB mode. If "q" is entered to exit from KDB mode, the
partition fails to start. The AIX partition must be powered off
and back on to recover. The problem can be circumvented by
disabling Firmware Assisted Dump (default is enabled in AIX 7.2).
- For a system partition with more than 64 cores, a problem
was fixed for Live Partition Mobility (LPM) migration operations
failing with HSCL365C. The partition migration is stopped because
the platform detects a firmware error anytime the partition has more
than 64 cores.
|
SC830_093_048 / FW830.22
06/28/16 |
Impact:
Availability Severity: SPE
Critical firmware update for
FW830.21 (SC830_092) level systems
System IPLed with FW830.21:
A critical firmware update is required for all 9119-MME and 9119-MHE
systems that have been IPLed with FW830.21 (SC830_092). The FW830.21
level can cause a failed IPL or a potential unplanned outage. If the
server is already in production, then customer should plan an outage at
a convenient time to apply FW 830.22 (SC830_093) or higher and IPL.
System had FW830.21 concurrently
applied: If firmware level FW830.21 was concurrently
installed (i.e. system was NOT IPL'ed after installing the level)
customers are not impacted by this issue provided they apply FW830.22
(SC830_093) or higher prior to next planned system reboot. NOTE: FW
830.22 can be applied concurrently.
System IPLed with any other
version of Firmware: If the current firmware level of the
system is not FW830.21, the system is not exposed to this issue.
Customers can install this level or later at the next scheduled update
window.
To verify the firmware level installed on the server, select “Updates”
from the left side of the HMC and place a check mark on the server of
interest. Then select “View system information” from the bottom view,
select “None - Display current values”. The Platform IPL Level will
indicate the last level the system was booted on.
System firmware changes that affect all systems
- A problem was fixed
for an intermittent failure in Hostboot during the system IPL resulting
in SRCs BC70090F and BC8A1701 logged with a hardware procedure return
code of "RC_PROC_BUILD_SMP_ADU_STATUS_MISMATCH". The system
terminates with a Terminate Immediate (TI) condition. The system
must be re-IPLed to recover. The failure is very infrequent and
was caused by a race condition introduced as part of clock card failure
data collection procedure which has now been corrected.
|
SC830_092_048 / FW830.21
06/01/16 |
Impact:
Availability Severity: SPE
System firmware changes that affect all systems
- Support for
additional First Failure Data Capture (FFDC) data for processor clock
failover errors provided by creating daily clock status reports with
SRC B150CCDA informational error logs. This clock status SRC log
is written into the Hardware Management Console (HMC) iqyylog.log as a
platform error log (PEL) event. The PEL event contains a dump of
the clock registers. If a processor clock fail over with SRC
B158CC62 occurs on the service processor, the iqyylog.log file on the
HMC should be collected to help debug the clock problem using the
B150CCDA data.
- A problem was fixed for a missing error log when a clock
card fails over to the backup clock card. This problem causes
loss of redundancy on the clock cards without a callout notification
that there is a problem with the FRU. If the fix is applied to a
system that had a failed clock, that condition will not be known until
the system is IPLed again when a error log and callout of the clock
card will occur if it is in a persisted failed state.
- On systems using PowerVM firmware with dedicated processor
partitions, a problem was fixed for the dedicated processor
partition becoming intermittently unresponsive. The problem can be
circumvented by changing the partition to use shared processors.
This is a follow-on to the fix provided in 830.20 for a different issue
for delays in dedicated processor partitions that were caused by low
I/O utilization.
- A problem was fixed for a secondary clock card (CCIN 6B49 )
failure on the system control unit (SCU) being called out as a local
clock card (CCIN 6B2D) failure on the node with SRC B158E504. For
this failure to occur, the primary clock card on the SCU must have been
previously failed and guarded.
|
SC830_086_048 / FW830.20
04/01/16 |
Impact:
Availability Severity: SPE
New features and functions
- Support was added to the Advanced System Management
Interface (ASMI) to be able to add a IPv4 static route definition for
each ethernet interface on the service processor. Using a static
route definition, a Hardware Management Console (HMC) configured
on a private subnet that is different from the service processor subnet
is now able to connect to the service processor and manage the
CEC. A static route persists until it is deleted or until the
service processor settings are restored to manufacturing
defaults. The static route is managed with the ASMI panel
"Network Services/Network Configuration/Static Route Configuration"
IPv4 radio button. The "Add" button is used to add a static route
(only one is allowed for each ethernet interface) and the "Delete"
button is used to delete the static route.
- Support was added to the Advanced System Management
Interface (ASMI) to display the environmental info section of error
logs in the "System Service Aids-> Error->Event logs"
panel. The following is an example of the information displayed:
|------------------------------------------------------
|
Environmental Info
|------------------------------------------------------
| Section Version
:
1
| Sub-section type :
0
| Created
by
:
powr
| Genesis Record Time-Stamp: 03/12/2015 15:31:21
| Genesis Corr-Resistance : 4.687847
| Genesis Ambient-Temp(C) : 28.000000
| Genesis Corrosion-Rate :
0
|
| Corrosion Rate Status :
1
| Presence of UsrDataSec : 1
| Num Corrosion Readings :
1
|
| Daily Corr-Resistance :
4.804206
| Daily Ambient-Tempr(C) :
35.312500
| Daily Corrosion-Rate :
12C
|------------------------------------------------------
System firmware changes that affect all systems
- A problem was fixed
for a power fault on a single node with SRC 11002610 that terminates
the multi-node system. The problem can be circumvented by
unplugging the failing node and the system will IPL. With the
fix, the failing node is guarded on the power fault and the rest of the
system is able to IPL.
- A problem was fixed for Advanced System Management
Interface (ASMI) TTY to allow "admin" passwords to be greater than
eight characters in length to be consistent with prior generations of
the product. The ASMI web interface works correctly for user
"admin" passwords with no truncation in the length of the passwords.
- A problem was fixed for the recovery of a failing PCI clock
so that a failover to the backup PCI clock occurs without a node
failing and being deconfigured. Without the fix, the PCI clock
does not behave as a redundant FRU and faults on it will cause the CEC
to terminate. A re-IPL of the CEC recovers it from the PCI clock
error with the bad clock guarded so that the other PCI clock is used,
- A problem was fixed for an intermittent IPL failure with
SRC B181E6C7 for a deadlock condition when testing the clocks during
the IPL. The problem state can be recovered by doing another
IPL. The problem is triggered by an error in the IPL clock test
causing a interrupt handler to switch to the redundant clock and
deadlock. With the fix, the clock fault is handled and the bad
clock is guarded, with the IPL completing on the redundant clock.
- A problem was fixed for a system IPL hang at C100C1B0 with
SRC 1100D001 when the power supplies have failed to supply the
necessary 12-volt output for the system. The 1100D001 SRC
was calling out the planar when it should have called out the power
supplies. With the fix, the system will terminate as needed and
call out the power supply for replacement. One mode of power
supply failure that could trigger the hang is sync-FET failures that
disrupt the 12-volt output.
- A problem was fixed for recovery from PNOR flash memory
corruption that causes the IPL to fail with SRC D143900C. This is
very rare and only has happened in IBM internal labs. Without the
fix, the service processor cannot correct the corruption in the
PNOR. If a system has the problem SRC and cannot IPL, then
that system must be disruptively firmware updated to apply the fix to
be able to IPL again.
- A problem was fixed for a PCIe3 I/O expansion drawer
(#EMX0) not getting all error logs reported when its error log queue is
full. In the case where the error log queue is full with 16
entries, only one entry is returned to the hypervisor for
reporting. This error log truncation only occurs during periods
of high error activity in the expansion drawer.
- A problem was fixed for recovering from a misplug of the
service processor FSI cables (U2-P1-C10-T2 and U1-P1-C9-T2) where the
plug locations are reversed from what would be a proper
connection. Without the fix, the bad FSI connections cause the
service processors to go to the service processor stop state.
With the fix applied, the error logs call out the bad cables so they
can be repaired and the service processor remains in a working state.
- A problem was fixed for hardware system dump collection
after a hardware checkstop that was missing scan ring data. This
is a very infrequent problem caused by an error with timing in the
multi-threaded dump collection process. Until this fix is
applied, the debug of some hardware dump problems may require doing
multiple dump collections to get all the data.
- A problem was fixed for an Advanced System Management
Interface (ASMI) error that occurred when trying to display detail on a
deconfigured Anchor Card VPD. If the error log for the selected
deconfiguration record had been deleted, it caused ASMI to core
dump. With the fix, if the error log for deconfiguration
record is missing, the error log details such as failing SRC for the
deconfiguration record are returned as blank.
- A problem was fixed for an On-Chip Controller error with
SRC B1702AC4 that was logged as a unrecoverable without hardware
callouts. This occurred when the slave OCC failed to
receive any Analog Power Subsystem Sweep (APSS) data over a long time
interval. With the fix, if the OCC fails in the same manner, the
error is predictive with hardware callouts in the error log.
- A problem was fixed in the Advanced System Management
Interface (ASMI) for a FRU exchange of a DVD where the DVD was not
being powered off as needed for the exchange. The missing power
off of the FRU could cause a data read or write error if the DVD is in
use when the DVD is removed. With the fix, the ASMI deactivate
DVD button turns off the DVD green power LED during the exchange
procedure, so it is known when it is safe to continue with the exchange
procedure steps and remove the DVD.
- A problem was for fixed so that error logs are now
generated for thermal errors detected by the service processor.
Without the fix, thermal errors such as a temperature over the
threshold will not get reported in the error log but higher fan speeds
will be present as an indicator of the thermal problem. Until the
fix is applied, the error log and call home mechanism cannot be relied
on to monitor for system thermal problems.
- A problem was fixed for processor core checkstops that
cause an LPAR outage but do not create hardware errors and service
events. The processor core is deconfigured correctly for the
error. This can happen if the hypervisor forces processor
checkstops in response to excessive processor recovery.
- A problem was fixed for the callout of a VPD collection
fault and system termination with SRC 11008402 to include the 1.2vcs
VRM FRU. The power good fault fault for the 1.2 volts would be a
primary cause of this error. Without the fix, the VRM is missing
in the callout list and only has the VPDPART isolation procedure.
- A problem was fixed for excessive logging of the SRC
11002610 on a power good (pgood) fault when detected by the Digital
Power Subsystem Sweep (DPSS). Multiple pgood interrupts are
signaled by the DPSS in the interval between the first pgood failure
and the node power down. A threshold was added to limit the
number of error logs for the condition.
- A problem was fixed for redundant logging of the SRC
B1504804 for a fan failure, once every five seconds. With the
fix, the failure is logged only at the initial time of failure in the
IPL.
- A problem was fixed to speed up recovery for VPD collection
time-out errors for PCIe resources in an I/O drawer logged with SRC
10009133 during concurrent firmware updates. With the fix, the
hypervisor is notified as soon as the VPD collection has finished so
the PCIe resources can report as available . Without the fix,
there is a delay as long as two hours for the recovery to complete.
- A problem was fixed for a false unrecoverable error (UE)
logged for B1822713 when an invalid cooling zone is found during the
adjustment of the system fan speeds. This error can be ignored as
it does not represent a problem with the fans.
- A problem was fixed for a processor clock failover error
with SRC B158CC62 calling out all processors instead of isolating to
the suspect processor. The callout priority correctly has a clock
and a procedure callout as the highest priority, and these should be
performed first to resolve the problem before moving on to the
processors.
- A problem was fixed for loss of back-level protection
during firmware updates if an anchor card has been replaced. The
Power system manufacturing process sets the minimum code level a system
is allowed to have for proper operation. If a anchor card is
replaced, it is possible that the replacement anchor card is one that
has the Minimum MIF Level (MinMifLevel) given as "blank", and
this removes the system back-level protection. With the fix, blanks or
nulls on the anchor card for this field are handled correctly to
preserve the back-level protection. Systems that have already
lost the back-level protection due to anchor card replacement remain
vulnerable to a accidental downgrade of code level by operator error,
so code updates to a lower level for these systems should only be
performed under guidance from IBM Support. The following command
can be run the Advanced Management Management Interface (ASMI) to
determine if the system has lost the back-level protection with the
presence of "blanks" or ASCII 20 values for MinMifLevel:
"registry -l cupd/MinMifLevel" with output:
"cupd/MinMifLevel:
2020202020202020 2020202020202020 [ ]
2020202020202020 2020202020202020 [ ]"
- A problem was fixed for a system checkstop caused by a L2
cache least-recently used (LRU) error that should have been a
recoverable error for the processor and the cache. The cache
error should not have caused a L2 HW CTL error checkstop.
- A problem was fixed that was corrupting the Update Access
Key (UAK) date with a corrupted date of "1900". The user
should correct the UAK date, if needed, to allow the firmware update to
proceed, by using the original UAK key for the system. On the
Management Console, enter the original update access key via the
"Enter COD Code" panel. Or on the Advanced System Manager Interface
(ASMI), enter the original update access key via the "On Demand
Utilities/COD Activation" panel.
- A problem was fixed for PCIe switch recovery to prevent a
partition switch failure during the IPL with error logs for SRC
B7006A22 and B7006971 reported. This problem can occur when doing
recovery for an informational error on the switch. If this
problem occurs, the partition must be restarted to recover the affected
I/O adapters.
- A problem was fixed to correct the error messages for early
failures in the Live Partition Mobility (LPM) migration of a
partition. The management console might report an unrelated error
such as "HSCLA27E The operation to lock the physical device
location for target adapter" when the actual error might be not enough
available memory on the target CEC to run the migration. With the
fix, the correct error code is returned so there is enough information
to correct the error and retry the migration.
- A problem was fixed for a hypervisor task hang during a FRU
exchange on the PCIe3 I/O expansion drawer (#EMX0) that requires the
entire drawer to power off and power on again. The activation
phase for the power on may never complete if a very rare sequence of
events occurs during the power on step. The FRUs to exchange that
would cause the expansion drawer to power off and power on are
the following: midplane, I/O module, I/O module VRM, chassis
management card (CMC), cable card, and active optical cable.
- A problem was fixed for PCIe adapter hangs and network
traffic error recovery during Live Partition Mobility (LPM) and SR-IOV
vNIC (virtual ethernet adapter) operations. An error in the
PCI Host Bridge (PHB) hardware can persist in the L3 cache and fail all
subsequent network traffic through the PHB. The PHB error
recovery was enhanced to flush the PHB L3 cache to allow network
traffic to resume.
- A problem was fixed for a network boot/install failure
using bootp in a network with switches using the Spanning Tree Protocol
(STP). A network boot/install using lpar_netboot on the
management console was enhanced to allow the number of retries to be
increased. If the user is not using lpar_netboot, the number of
bootp retries can be increased using the SMS menus. If the SMS
menus are not an option, the STP in the switch can be set up to allow
packets to pass through while the switch is learning the network
configuration.
- A problem was fixed for a hypervisor adjunct partition
failed with "SRC B2009008 LP=32770" for an unexpected SR-IOV adapter
configuration. Without the fix, the system must be re-IPLed to
correct the adjunct error. This error is infrequent and can only
occur if an adapter port configuration is being changed at the same
time that error recovery is occurring for the adapter.
- A problem was fixed for recovering from FSI interrupt
overruns (too many FSI interrupts at one time that cause the service
processor to go interrupt-bound and get stuck in a loop) that caused
the service processor to go to a failed state with SRC B1817212 on
systems with a single service processor. On systems with
redundant service processors, the failed service processor would get
guarded with a B151E6D0 or B152E6D0 SRC depending on which service
processor fails. With the fix, the FSI interrupt generation is
reset if a threshold is exceeded, allowing the service processor to
continue normal processing. The failure trigger is a rare
hardware fault condition that does not persist in the service processor.
- A problem was fixed for priority callouts for system clock
card errors with SRC B158CC62. These errors had high priority
callouts for the system clock card and medium callouts for FRUs in the
clock path. With the fix, all callouts are set to medium priority
as the clock card is not the most probable FRU to have failed but is
just a candidate among the many FRUs along the clock path.
- A problem was fixed for a degraded PCI link causing a
processor core to be guarded if a non-cacheable unit (NCU) store
time-out occurred with SRC B113E540 and PRD signature
"(NCUFIR[9]) STORE_TIMEOUT: Store timed out on PB". With the fix,
the processor core is not guarded for the NCU error. If this
problem occurs and a core is deconfigured. clear the guard record and
re-IPL to regain the processor core. The solution for degraded
PCI links is different from the fix for this problem, but a re-IPL of
the CEC or a reset of the PCI adapters could help to recover the PCI
links from their degraded mode.
- A problem was fixed for a L2 cache error on the service
processor that caused the service processor to reset or go to a failed
state with SRC B1817212 on systems with a single service
processor. On systems with redundant service processors, the
failed service processor would get guarded with a B151E6D0 or B152E6D0
SRC depending on which service processor fails. With the fix, the
L2 cache error is handled with single-bit corrected with no error to
the service processor, so it can continue normal processing. The
L2 cache data error that causes this fail is infrequent and the service
processor requires its limit of three resets in fifteen minutes to be
exceeded for the service processor to fail, so service processor
failure rate for this problem is low.
- A problem was fixed for an incorrect reduction in FRU
callouts for Processor Run-time Diagnostic (PRD) errors after a
reference oscillator clock (OSCC) error has been logged. Hardware
resources are not called out and guarded as expected. Some of the
missing PRD data can be found in the secondary SRC of B181BAF5 logged
by hardware services. The callouts that PRD would have made are
in the user data of that error log.
- A problem was fixed for error recovery from failed Live
Partition Mobility (LPM) migrations. The recovery error is caused
by a partition reset that leaves the partition in an unclean state with
the following consequences: 1) A retry on the migration for the
failed source partition may not not be allowed; and 2) With enough
failed migration recovery errors, it is possible that any new migration
attempts for any partition will be denied. This error condition
can be cleared by a re-IPL of the system. The partition recovery error
after a failed migration is much more likely to occur for
partitions managed by NovaLink but it is still possible to occur for
Hardware Management Console (HMC) managed partitions.
- A problem was fixed for a Qualys network scan for security
vulnerabilities causing a core dump in the Intelligent Platform
Management Interface (IPMI) process on the service processor with
SRC B181EF88. The error occurs anytime the Qualys scan is run
because it sends an invalid IPMI session id that should have been
handled and discarded without a core dump.
- A security problem was fixed in the lighttpd server on the
service processor, where a remote attacker, while attempting
authentication, could insert strings into the lighttpd server log
file. Under normal operations on the service processor, this does
not impact anything because the log is disabled by default. The
Common Vulnerabilities and Exposures issue number is CVE-2015-3200.
- A security problem was fixed in OpenSSL for a possible
service processor reset on a null pointer de-reference during RSA PPS
signature verification. The Common Vulnerabilities and Exposures issue
number is CVE-2015-3194.
- A problem was fixed to guard a failed processor core to
allow the system to IPL. The processor core chiplet FRU was
failing to be called out and guarded on a
RC_PMPROC_CHKSLW_ADDRESS_MISMATCH error and this prevented the system
from being able to IPL.
System firmware changes that affect certain systems
- On multi-node
systems with a power fault, a problem was fix for On-Chip Controller
errors caused by the power fault being reported as predictive errors
for SRC B1602ACB. These have been corrected to be informational
error logs. If running without the fix, the predictive and
unrecoverable errors logged for the OCC on loss of power to the node
can be ignored.
- On a multi-node system, a problem was fixed for a
power fault with SRC 11002610 having incorrect FRU callouts. The
wrong second FRU callout is made on nodes 2, 3, and 4 of a multi-node
system. Instead of calling out the processor FRU, the enclosure
FRU is called out. The first FRU callout is correct.
- On PowerVM systems with dedicated processor partitions with
low I/O utilization, the dedicated processor partition may become
intermittently unresponsive. The problem can be circumvented by
changing the partition to use shared processors.
- On systems where memory relocation (as done by using Live
Partition Mobility (LPM)) and a partition reboot are occurring
simultaneously, a problem for a system termination was fixed. The
potential for the problem existed between the active migration and the
partition reboot.
- On a system running a IBM i partition, a problem was
fixed for a machine check incorrectly issued to an IBM i partition
running 7.2 or later with 4K sector disks. This problem only
pertains to the IBM Power System S814 (8286-41A) , S824 (8286-42A),
E870 (9119-MME), and E880 (9119-MHE) models.
- A problem was fixed that limited Virtual Functions (VFs) to
a maximum of 50 on a single PCIe3 10GbE adapter (feature codes
#EN15, #EN16, #EN17, and #EN18; and CCINs 2CE3 and 2CE4) when 64 should
have been allowed. This problem only occurs for two of the SR-IOV
capable slot locations in the Power Systems: slot C4 in the PCIe3
I/O expansion drawer (#EMX0) and slot C7 in the Power System E850
(8408-E8E).
- A problem was fixed for an extraneous PCIe switch SRC
B7006A22 being called out when there is a valid PCIe expansion
drawer cable problem with SRC B7006A88 reported. The callout for
SRC B7006A22 should be ignored as the PCIe switch hardware is working
for this case.
- On a system with a AIX partition and a Linux partition, a
problem was fixed for dynamically moving an adapter that uses DMA from
the Linux partition to the AIX partition that caused the AIX to fail by
going into KDB mode (0c20 crash). The management console showed
the following message for the partition operation: "Dynamic move
of I/O resources failed. The I/O slot dynamic partitioning
operation failed.". The error was caused by Linux using 64K
mappings for the DMA window and AIX using 4K mappings for the DMA
window, causing incorrect calculations on the AIX when it received the
adapter. Until the fix is applied, the adapters that use DMA
should only be moved from Linux to AIX when the partitions are powered
off. This problem does not pertain to Power System
S812L(8247-21L), S822L(8247-22L), and S824L(8247-42L) models.
- A problem was fixed for a Live Partition Mobility migration
failure of a time reference partition (TRP) to a FW830 system when
setting partition hibernate capable "false". This happens any
time the TRP partition is attempted to be migrated. To circumvent
the problem, set the partition's Time Reference Property to disabled
and retry the migration.
- On systems with a partition using Active memory Sharing
(AMS), a problem was fixed for a Live Partition Mobility (LPM)
migration of the AMS partition that can hang the hypervisor on the
target CEC. When an AMS partition migrates to the target CEC, a
hang condition can occur after processors are resumed on the target
CEC, but before the migration operation completes. The hang will
prevent the migration from completing, and will likely require a CEC
reboot to recover the hung processors. For this problem to occur,
there needs to be memory page-based activity (e.g. AMS dedup or Pool
paging) that occurs exactly at the same time that the Dirty Page
Manager's PSR data for that page is being sent to the target CEC.
- On systems with an invalid P-side or T-side in the
firmware, a problem was fixed in the partition firmware Real-Time
Abstraction System (RTAS) so that system Vital Product Data (VPD) is
returned at least from the valid side instead of returning no VPD
data. This allows AIX host commands such as lsmcode, lsvpd,
and lsattr that rely on the VPD data to work to some extent even if
there is one bad code side. Without the fix, all the VPD
data is blocked from the OS until the invalid code side is recovered by
either rejecting the firmware update or attempting to update the system
firmware again.
- On systems using PCIe adapters in SR-IOV mode, a problem
was fixed for occasional B200F011 and B2009008 SRCs that can occur
during an IPL, moving a adapter into SR-IOV mode, or with SR-IOV link
up/down activity.
- On systems using PCIe adapters in SR-IOV mode, the
following problems were addressed with a Broadcom Limited (formerly
known as Avago Technologies and Emulex) adapter firmware update to
10.2.252.1905: 1) Eliminating virtual function (VF) transmit
errors during VF resets and 2) Preventing loss of legacy flow
control when an adapter port is connected to a priority flow control
(PFC) capable switch.
- On systems with a AIX or Linux encapsulated state
partitions, a problem was fixed for a Live Partition Mobility migration
failure for the encapsulated state partitions. The migration
fails on the target CEC when the associated paging space needed to
support the encapsulated state is not available. Removing the
"Encapsulated State" attribute from the partition would allow the
migration to succeed. However, removing this attribute can only
be accomplished if the partition in the powered off state.
Encapsulated State partitions are needed for the remote restart
feature. An encapsulated state partition is a partition in which
the configuration information and the persistent data are stored
external to the server on persistent storage. A partition that
supports remote restart can be restarted remotely. For more
information on the remote start feature, refer to this IBM Knowledge
Center link: http://www.ibm.com/support/knowledgecenter/P8DEA/p8efd/p8efd_lpar_general_props.htm
- Support was added to eliminate the yearly Utility COD
renewal on systems using Utility COD. The Utility COD usage is
already monitoring to make sure systems are running within the
prescribed threshold limit of unreported usage, so a yearly customer
renewal is not needed to manage the Utility COD processor usage.
|
SC830_075_048 / FW830.11
11/11/15 |
Impact:
Availability Severity: HIPER
System firmware changes that affect all systems
- HIPER/Pervasive:
A problem was fixed for recovering from embedded MultiMediaCard (eMMC)
flash NAND errors that caused the service processor to go to a failed
state with SRC B1817212 on systems with a single service
processor. On systems with redundant service processors, the
failed service processor would get guarded with a B151E6D0 or B152E6D0
SRC depending on which service processor fails.
- HIPER/Pervasive: A
problem associated with workloads using transactional memory on PowerVM
was discovered and is fixed in this service pack. The effect of the
problem is non-deterministic but may include undetected corruption of
data.
- DEFERRED: A
problem was fixed for memory on-die termination (ODT) settings to
improve the signal integrity of the memory channel.
- A problem was fixed for recovery from unaligned addresses
for MSI interrupts from PCIe adapters. The recovery prevents an
adapter timeout caused by resource exhaustion. With the fix, the
resources for each bad interrupt are returned, allowing the PCIe
adapter to continue to run for the normal traffic.
- A problem was fixed for an Operations Panel SRC of B1504804
with no FRU callout. A callout of the failed hardware has been
added.
- A problem was fixed to prevent recoverable power faults of
short duration from causing the system to lose power supply
redundancy. Without the fix, the faulted state persisted for the
recovered power fault, causing a problem with a system power off if
other power supplies were lost at a later time.
- A problem was fixed for a PCIe3 I/O expansion drawer
(#EMX0) link failure with SRC B7006A8B . The settings for the
continuous time linear equalizers (CTLE) were adjusted to improve the
incoming signal strength to improve the stability of the links.
The expansion drawer must be power cycled or the CEC can be re-IPLed
for the fix to activate.
- A problem was fixed for recovery from a processor local bus
(PLB) hang on the service processor. The errant PLB hang recovery
would be seen in concurrent firmware updates that, on rare occasions,
fail to do a side switch to activate to the new level of
firmware. On the management console, the error message would be
HSCF010180E Operation failed ... E302F873 is the error code."
Other than the failed code level activation, the firmware update is
successful. If this problem occurs, the system can be set to the
new firmware level by doing a power off from the management console and
then doing a power on with side switch selected in the advanced
properties.
System firmware changes that affect certain systems
- A problem was fixed
for the System Feature Code for the E870 (9119-MME) being displayed as
"EPBB" by IBM i "DSPSYSVAL QPRCFEAT" when it should be
"EPBA". This created a problem for certain IBM i software
packages whose license was tied to the System Feature Code. This
fix has a concurrent activation. For FW830.10, a similar,
non-concurrent fix for the feature codes was made but the System
Feature Code, as seen in IBM i partitions, did not update
immediately.
|
SC830_068_048 / FW830.10
09/10/15 |
Impact:
Availability Severity: HIPER
New features and functions
- The firmware code update process was enhanced with a
feature to block a firmware "downgrade" to a level that is below the
system's manufactured code level.
System firmware changes that affect all systems
- HIPER/Pervasive:DEFERRED:
A problem was fixed for a TCP/IP performance degradation on PCIe
ethernet adapters with Remote Direct Memory Access (RDMA) over
Converged Ethernet (RoCE). By adjusting the system memory
caching, a significant improvement was made to the data throughput
speed to restore performance to expected levels. This fix
requires a system re-IPL to take effect. This problem affects the
E850 (8408-E8E), E870 (9119-MME), and E880 (9119-MHE) systems.
- HIPER/Pervasive:
A
problem
was fixed for an ethernet adapter hanging on the service
processor.
This hang prevents TCP/IP network traffic from the managment console
and the Advanced System Management Interface (ASMI) browsers. It
makes
it appear as if the service processor is unresponsive and can be
confused with a service processor in the stopped state.. An A/C
power
cycle would recover a hung ethernet adapter.
- HIPER/Pervasive:
A
problem was fixed for missing the
interrupts for processor local bus (PLB) time-outs.. This problem
could hang the service processor or cause it to panic with a
reset/reload of the service processor. There is a possibility the
reset of the service processor could take it to a stopped state where
the service processor would be unresponsive. In the service
processor
stopped state, any active partitions will continue to run but they will
not be able to be managed by the management console. The
partitions
can be allowed to run until the next scheduled service window at which
time the service processor can be recovered with an AC power cycle or a
pin-hole reset from the operator panel.
- HIPER/Pervasive:
A
problem was fixed for a system
reset to clear the boot registers to prevent the reset from being
mishandled as chip reset. If a "system reset" is
misinterpreted as a
"chip reset", the boot of the service processor can go inadvertently to
a stopped state and be unresponsive. Pin-hole resets from the
operations panel could also fail to the service processor stopped
state. In the service processor stopped state, any active
partitions
will continue to run but they will not be able to be managed by the
management console. The partitions can be allowed to run until
the
next scheduled service window at which time the service processor can
be recovered with an AC power cycle or a pin-hole reset from the
operator panel.
- HIPER/Pervasive:
A
problem was
fixed so a corrupted file system partition table can be recovered and
not have the service processor lose the ability to do P and T-side
switches. In error recovery situations, the loss of the
side-switch
option could present itself as an unresponsive service processor if it
was needed to prevent a failure to the service processor stopped
state.
- HIPER/Pervasive:
A
problem was fixed for a runaway
interrupt request (IRQ) condition that caused the service processor to
go to a stopped state. In the service processor stopped state,
any
active partitions will continue to run but they will not be able to be
managed by the management console. The partitions can be allowed
to
run until the next scheduled service window at which time the service
processor can be recovered with an AC power cycle or a pin-hole reset
from the operator panel.
- HIPER/Pervasive:
A
problem was fixed for a dump
partition full condition that caused the service processor to go to a
stopped state. In the service processor stopped state, any active
partitions will continue to run but they will not be able to be managed
by the management console. The partitions can be allowed to run
until
the next scheduled service window at which time the service processor
can be recovered with an AC power cycle or a pin-hole reset from the
operator panel.
- DEFERRED: A
problem was fixed for a PCIe3 I/O expansion drawer (#EMX0)
link failure with SRC B7006A8B . Data packet send retries were
increased and link recovery was enabled to improve the stability of the
links. The CEC must be re-IPLed for the fix to activate.
- A problem was fixed for a SRC 11002613 logged during a
concurrent repair of a power supply. This SRC was erroneously
logged and did not represent a real problem.
- A problem was fixed for an intermittent SRC B1504804 logged
on a re-ipl of the CEC but that did not result in an IPL failure.
- A problem was fixed for the capture of the registers for
the Hostboot Self-Boot Engine (SBE) for SBE failures. These
registers had been missing from failure data for SBE failures, making
these problems more difficult to debug.
- A problem was fixed to remove an unnecessary delay in the
system IPL to reduce the time needed to IPL by 30 seconds.
- A problem was fixed for an unneeded error log with SRC
B181DB04 that occurred in a failed IPL for a normal condition of lost
PNOR flash access after a reIPL process had started and taken over the
access.
- A problem was fixed for an Advanced System Manager
Interface (ASMI) error message of "Error in function 'connect", error
code 111" when a browser attempted to connect before the service
processor was ready. The browser connection through the web
server is now held off until the ASMI process is ready after a reset of
the service processor or a AC power cycle of the system.
- A problem was fixed for an incorrect call home for SRC
B1818A0F. There was no real problem so this call home should have
been ignored.
- A problem was fixed for a dump reIPL that failed with SRC
B1818601 and B181460B after processor checkstops had terminated the
system.
- A problem was fixed for an infrequent service processor
database corruption during concurrent firmware update that caused the
system to terminate.
- A problem was fixed for a failed PCI oscillator that was
not guarded, causing repeated errors with SRC B15050A6 and B158E504
logged on each IPL of the system.
- A problem was fixed for a local clock card (LCC)
failure with SRC 11001515 that was missing a part number and location
code. This information has been added for LCC faults so the FRU
to replace is properly identified.
- A problem was fixed for a defective PCI oscillator in the
local clock card (LCC) with SRC BC58090F that caused a IPL failure for
the node instead of failing over to the redundant LCC.
- A problem was fixed for a service processor dump with error
logs B181E911 and B181D172 during an IPL. The error logs
were for the detection of defunct processes but otherwise the IPL was
successful.
- A problem was fixed for Digital Power Subsystem Sweep
(DPSS) firmware updates that caused an error log with SRC B1819906 but
otherwise was successful.
- A problem was fixed for missing Keyword (KW) and Resource
ID (RID) for SRC B181A40F.
- A problem was fixed for a I2C bus lock error during a CEC
power off that caused a ten minute delay for the power off and
errorlog SRCs B1561314 and B1814803 with error number (errno) 3E.
- A problem was fixed for the System Feature Code for the
E870(9119-MME) being displayed as "EPBB" by IBM i "DSPSYSVAL
QPRCFEAT" when it should be "EPBA". This created a problem
for certain IBM i software packages whose license was tied to the
System Feature Code. The System Feature Code, as seen in IBM
i partitions, does not update immediately with concurrent
activation of the fix pack, but it will eventually change to the
correct "EPBA" value within 24 hours. If it is necessary to see
the new System Feature Code value immediately, a re-IPL of the
system is needed.
- A problem was fixed for concurrent firmware updates to a
system that needed to be re-IPLed after getting a B113E504 SRC during
activation of the new firmware level on the hypervisor. The code
update activate failed if the Sleep Winkle (SLW) images were
significantly different between the firmware levels. The SLW
contains the state of the processor and cache so it can be restored
after sleep or power saving operations.
- A problem was fixed for System Power Control Network (SPCN)
failover for a I/O module A/C power fault on the PCIe3 I/O
expansion drawer (#EMX0). A sideband failure on one I/O module
was blocking SPCN commands for the entire drawer instead of SPCN
failing over to a working I/O module. The broken SPCN
communications path prevented concurrent maintenance operations
on the expansion drawer.
- A problem was fixed for a possible lack of recovery for an
A/C power loss condition on the PCIe3 I/O expansion drawer
(#EMX0). If there was an outstanding problem on the
expansion drawer and an A/C loss occurred while the earlier error was
still unprocessed, the auto-recovery for the A/C power loss would not
have happened.
- A problem was fixed for a missing FRU call out for error
SRC B7006A87 when unable to read the drawer module logical flash
VPD for the PCIe3 I/O expansion drawer (#EMX0).
- For a partition that has been migrated with Live Partition
Mobility (LPM) from FW730 to FW740 or later, a problem was fixed for a
Main Storage Dump (MSD) IPL failing with SRC B2006008. The MSD
IPL can happen after a system failure and is used to collect failure
data. If the partition is rebooted anytime after the migration,
the problem cannot happen. The potential for the problem existed
between the active migration and a partition reboot.
- A problem was fixed for partial loss of Entitlement for
On/Off Memory Capacity On Demand (also called Elastic COD). Users
with large amounts of Entitlement on the system of greater than "65535
GB * Days" could have had a truncation of the Entitlement value on a
re-IPL of the system. To recover lost Entitlement, the customer
can request another On/Off Enablement Code from IBM support to
"re-fill" their entitlement.
- A problem was fixed for a management console command line
failure with a return code 0x40000147 (invalid lock state) when trying
to delete SR-IOV shared mode configurations. This could have
occurred if the adapter slot had been re-purposed without involvement
of the management console and was owned and operational at the time of
the requested delete. With the fix, the current ownership of the
slot is honored and only the SR-IOV shared mode configuration data is
deleted on the force delete.
- A problem was fixed for an incorrect restriction on
the amount of "Unreturned" resources allowed for a Power
Enterprise Pool (PEP). PEP allows for logical moving of resources
(processors and memory) from one server to another. Part of this
is 'borrowing' resources from one server to move to another. This may
result in "Unreturned" resources on the source server. The management
console controls how many total "Unreturned" PEP resources can
exist. For this problem, the user had some "Unreturned" PEP
memory and asked to borrow more but this request was incorrectly
refused by the hypervisor.
- A problem was fixed for a PCIe3 I/O expansion drawer
(#EMX0) error with SRCs B7006A82 and B7004137 for a missing FRU
location code. The FRU location code for the Active Optical Cable
(AOC) was added to identify the failing drawer side.
- A problem was fixed for a PCIe3 I/O expansion drawer
(#EMX0) failing to IPL when the IPL includes a FPGA update for
the drawer. The FPGA update is actually good but perceived as a
failure when the FPGA resets as part of the update. For the
problem, a re-IPL of the system would have fixed the drawer.
- A problem was fixed for Live Partition Mobility (LPM) to
prevent a memory access error during LPM operations with unpredictable
affects. When data is moved by LPM, the underlying firmware code
requires that the buffers be 4K aligned. The fixes made now force
the buffers to be 4K aligned and if there is still an alignment issue,
the LPM operation will fail without impacting the system.
- A problem was fixed for an On-Chip Controller (OCC) failure
after a system dump with SRCs B18B2616 and BC822024 reported.
This resulted in the system running with reduced performance in safe
mode, where processor clock frequencies are lowered to minimum levels
to avoid hardware errors since the OCC is not available to monitor the
system. A re-IPL of the system would have resolved the
problem.
- A performance problem was fixed for systems entering
processor hang recovery prematurely with SRC B111E504 and PBCENTFIR(9)
"PB_CENT_HANG_RECOV". The ability of the L3 cache to prefetch
memory was extended to speed the memory accesses and prevent a
processor hang condition for applications running with lower memory
affinity.
- A problem was fixed for a processor error causing a
Hostboot terminate instead of a deconfiguration of the bad hardware and
continuation of the IPL. The state of the processors was
synchronized between the service processor and the Hostboot process to
correct the error.
- A problem was fixed for a USB Save and Restore of machine
configuration to not lose the system name.
- A problem was fixed for Advanced System Management
Interface (ASMI) help text for menu "I/O Adapter Enlarged Capacity"
being missing with the system IPLed and partitions running. The
help text is now available for the system in the powered on state as
well as in the powered off state.
- A problem was fixed for an intermittent power supply error
SRC 1100D008 with a flood of VPD SRC B1504804 with errno 3Es logged on
a re-ipl of the CEC but that did not result in an IPL failure.
- A problem was fixed for a LED intermittently not lighting
for an enclosure with a fault.
- A problem was fixed for an intermittent PSI link error with
SRC B15CDA27 after a firmware update or reset/reload of the service
processor.
- A problem was fixed for PCIe3 adapters failing when
requesting more than 32 Message Signaled Interrupts (MSI-X). The
adapter may fail to ping or cause OS tasks to hang that are using the
adapter. This problem was found specifically on the 10 Gb
Ethernet-SR (Short Range) PCIe3 adapter with feature codes #5275 and
#5769 and on the 56 Gb Infiniband (IB) Fourteen Data Rate (FDR) adapter
with feature codes #EC32, #EC33, #EL3D, and #EL50 and CCIN 2CE7.
However, other PCIe adapters may also be affected.
- A problem was fixed for IBM copyright statements being
displayed on the System Management Services (SMS) menu after a repair
or replacement of system hardware.
System firmware changes that affect certain systems
- HIPER/Pervasive:
For
partitions with a graphics
console and USB keyboard, a problem was fixed for a OS boot hang at the
CA00E100 progress SRC. For the problem, the hang can be avoided
by
issuing the boot command from the Open Firmware (OF) prompt.
- HIPER/Pervasive:
On
systems using
PowerVM with shared processor partitions that are configured as capped
or in a shared processor pool, there was a problem found that delayed
the dispatching of the virtual processors which caused performance to
be degraded in some situations. Partitions with dedicated
processors
are not affected. The problem is rare and can be mitigated,
until the
service pack is applied, by creating a new shared processor AIX or
Linux partition and booting it to the SMS prompt; there is no need to
install an operating system on this partition. Refer to help
document http://www.ibm.com/support/docview.wss?uid=nas8N1020863
for additional
details.
- DEFERRED: A
problem was fixed for
Non-Volatile Memory express (NVMe) adapters, plugged into PCIe3
switches, mis-training to generation 1 instead of generation
3. NVMe
adapters attached directly to the PCIe3 slots trained correctly to the
generation 3 specification. This fix requires a re-IPL of the system to
correct the training of any mis-trained adapters.
- On multiple-node systems, a problem was fixed for a missing
location code, part, and serial number for a faulty symmetric
multiprocessing (SMP) cable in the call home B1504922 error log.
- On multiple-node systems, a problem was fixed for a two
hour IPL hang in HostBoot caused by multiple B18ABAAB errors from more
than one node. The Hostboot process failed to go into its
reconfiguration loop to do error recovery and continue the IPL.
- On a system with redundant service processors, a
problem was fixed for an IPL failure for a bad service processor cable
on the primary service processor with SRCs B1504904 and B18ABAAB
logged. The system should have did an error failover to the
backup service processor and continued the IPL to get the partitions
running.
- On a system with redundant service processors where
redundancy is disabled, a problem was fixed for an unrecoverable (UE)
SRC B181DA19 being logged on a re-IPL after a checkstop error.
The error log did not interfere with the reIPL which was successful.
- On multiple-node systems, a problem was fixed for
extraneous error logs after a 12V power fault. After termination,
there were additional 110026Bx error log entries that should have been
ignored.
- On a system with redundant service processors, a problem
was fixed for the isolation procedures for an Anchor card error and
system VPD collection failure with termination SRC B181A40F .
FSPSP04 and FSPSP06 are no longer called out as part of reporting the
VPD collection failure. FSPSP30 has been updated with isolation
steps for this problem and is called out and should be used for the
problem isolation. Retain tip H213935 also provides the FRU
isolation steps. Procedure FSPSP30 tries to replace the service
processor first. If that does not work, then the procedure has
the Anchor card replaced.
- On multiple-node systems, a problem was fixed to isolate a
power fault during IPL to the specific node and guard the node, and
allow the rest of the system to IPL. Previously, the power fault
would not be localized to the problem node and it caused the IPL of all
the nodes of the system to fail.
- On a system with redundant service processors, a problem
was fixed for failovers to the backup service processor that caused an
On-Chip Controller (OCC) abort. This placed the CEC in a "safe"
mode where it ran at reduced processor clock frequencies to prevent
exceeding the power limits while not under OCC control.
- On a system with an IBM i partition using Active Memory
Sharing (AMS), a problem was fixed for internal memory management
errors caused by deleting a IBM i partition that had been powered off
in the middle of a Main Storage Dump (MSD). Until the fix is
installed, if a MSD is interrupted for a IBM i partition that has AMS,
the partition should be powered on and powered off normally before a
delete of the partition is done to prevent errors with unpredictable
affects. This problem does not affect the S822 (8284-22A),
S812L(8247-21L), S822L (8247-22L), S824L(8247-42L), and E850 (8408-E8E)
models.
- On a system with redundant service processors, a problem
was fixed for a failover to the backup service processor during a power
off of the CEC that caused a hypervisor time-out with SRC
B182953C. This error was caused by a delay in synchronizing the
state of the hypervisor to the backup service processor but it did not
prevent the power off from completing successfully.
- On a system with redundant service processors, a problem
was fixed for a firmware update causing an error log server dump with
SRC B1818601. The error log server restarted automatically to
recover from the error and the firmware update was successful.
|
SC830_048_048 / FW830.00
06/08/15 |
Impact:
New
Severity: New
New Features and Functions
NOTE:
- POWER8 (and later) servers include an “update access key”
that is checked when system firmware updates are applied to the
system. The initial update access keys include an expiration date
which is tied to the product warranty. System firmware updates will not
be processed if the calendar date has passed the update access key’s
expiration date, until the key is replaced. As these update
access keys expire, they need to be replaced using either the Hardware
Management Console (HMC) or the Advanced Management Interface (ASMI) on
the service processor. Update access keys can be obtained via the
key management website: http://www.ibm.com/servers/eserver/ess/index.wss.
- Support for Little Endian (LE) Linux in PowerVM. With
PowerVM LE guest support, all three Linux on Power distribution
partners (SUSE, Canonical, and Red Hat) with LE operating systems can
run on the same IBM Power Systems.
- Support for allowing the PowerVM hypervisor to continue to
run after the service processor has become unresponsive with a SRC
B1817212. Any active partitions will continue to run but they
will not be able to be managed by the management console. The
partitions can be allowed to run until the next scheduled service
window at which time the service processor can be recovered with an AC
power cycle or a pin-hole reset from the operator panel. This
error condition would only be seen on a system that had been running
with a single service processor (no redundancy for the service
processor).
- Support for three and four node configurations of the E880
(9119-MHE) system.
- Support for an increase of the maximum number of PCIe 3 I/O
expansion drawers (#EMX0) that can be attached to an E870 /E880 node
from two to four.
- Support for Single Root I/O Virtualization (SR-IOV) that
enables the hypervisor to share a SR-IOV-capable PCI-Express adapter
across multiple partitions. Twelve ethernet adapters are supported with
the SR-IOV NIC capability, when placed in the P8 system (SR-IOV
supported in both native mode and through VIOS):
- PCIe3 4-port 10GbE SR Adapter
(F/C EN15 and CCIN 2CE3)
- PCIe3 4-port 10GbE SR Adapter
(F/C EN16 and CCIN 2CE3).
Fits E870/E880 system node PCIe slot.
- PCIe3 4-port 10GbE SFP+ Copper Adapter
(F/C EN17 and CCIN 2CE4)
- PCIe3 4-port 10GbE SFP+ Copper Adapter
(F/C EN18 and CCIN 2CE4). Fits E870/E880
system node PCIe slot.
- PCIe2 4-port (10Gb FCoE & 1GbE) SR and RJ45 SFP+
Adapter (F/C EN0H and CCIN 2B93)
- PCIe2 LP 4-port (10Gb FCoE & 1GbE) SR and RJ45 SFP+
Adapter (F/C EN0J and CCIN 2B93)
- PCIe2 LP Linux 4-port (10Gb FCoE & 1GbE) SR and RJ45 SFP+
Adapter (F/C EL38 and CCIN 2B93)
- PCIe2 4-port (10Gb FCoE & 1GbE) LR and RJ45 Adapter
(F/C EN0M and
CCIN 2CC0)
- PCIe2 LP 4-port (10Gb FCoE & 1GbE) LR and RJ45 Adapter
(F/C EN0N and
CCIN 2CC0)
-PCIe2 4-port (10Gb FCoE & 1GbE) SFP+Copper and RJ45
Adapter (F/C EN0K and CCIN 2CC1)
- PCIe2 LP 4-port (10Gb FCoE & 1GbE) SFP+Copper and
RJ45 Adapter
(F/C EN0L and CC IN 2CC1)
- PCIe2 LP Linux 4-port (10Gb FCoE & 1Gb Ethernet) SFP+Copper and
RJ45 (F/C EL3C and CCIN 2CC1)
These adapters each have four ports, and all four ports are enabled
with SR-IOV function. The entire adapter (all four ports) is configured
for SR-IOV or none of the ports is.
System firmware updates the adapter firmware level on these adapters to
10.2.252.16 when a supported adapter is placed into SR-IOV mode.
Support for SR-IOV adapter sharing is now available for adapters in the
PCIe3 I/O Expansion Drawer with F/C #EMX0.
SR-IOV NIC on the Power P8 systems is supported by:
- AIX 6.1 TL9 SP4 and APAR IV63331, or later
- AIX 7.1 TL3 SP4 and APAR IV63332, or later
- IBM i 7.1 TR8, or later (Supported on S824/S814)
- IBM i 7.2 or later (Supported on
S824/S814)
- IBM i 7.1 TR9, or later (Supported on E870/E880)
- IBM i 7.2 TR1, or later (Supported on
E870/E880)
-
Red Hat Enterprise Linux 6.5 or later ( Supported on
E870/E880/S812L/S822/S822L/S814/S824/S824L except for adapters with
F/Cs EN15/EN16/EN17/EN18)
- Red Hat Enterprise Linux 6.6, or later (Supported
on E850 and minimum level needed for adapters with F/Cs
EN15/EN16/EN17/EN18)
- Red Hat Enterprise Linux 7.1, or later
- SUSE Linux Enterprise Server 11 SP1 or later
(Supported on S812L/S822/S822L/S814/S824/S824L)
- SUSE Linux Enterprise Server 11 SP3 or later
(Supported on E870/E880)
- SUSE Linux Enterprise Server 12, or later
(Supported on E850)
- Ubuntu 15.04 or later (Supported on
E850/S812L/S822/S822L/S814/S824/S824L)
- VIOS 2.2.3.4 with interim fix IV63331, or later
- Support for an upgrade from 8-core processors to 12-core
processors for the E880 (9119-MHE) system.
- Support for adjusting voltage regulators input voltage
dynamically based on regulator slave failures to achieve the optimal
voltage for system operation for normal and degraded conditions.
System firmware changes that
affect all systems
- A problem was fixed to eliminate unneeded guard data from
call home messages for the cases where there is no hardware error in
the system.
- On systems with redundant service processors, a problem was
fixed in
the run-time error failover to the backup service processor so it does
not terminate on FRU support interface (FSI) errors. In the case
of
FSI errors on the new primary service processor, the primary will do a
reset/reload instead of a terminate.
- A problem was fixed to call home guarded FRUs on each
IPL. Only the initial failure of the hardware was being reported
to the error log.
- Support was added to the Advanced System Management
Interface (ASMI)
USB menu to allow a system dump to be collected to USB with the power
on to the system. This allows the dump to be collected with the
system
memory state intact.
- A problem was fixed for the service processor error log
handling that
caused SRC B150BAC5 errors when converting a error log entry from an
object into a flattened array of bytes.
- A problem was fixed that prevented a second management
console from
being added to the CEC. In some cases, network outages caused
defunct
management console connection entries to remain in the service
processor connection table, making connection slots unavailable
for
new management consoles A reset of the service processor could be
used
to remove the defunct entries.
- A problem was fixed to eliminate a false error log and call
home for a
SRC1100154F fan fault caused by an unplugged power cable.
- A problem was fixed for a highly intermittent IPL failure
with SRC
B18187D9 caused by a defunct attention handler process. For this
problem, the IPL will continue to fail until the service processor is
reset.
A problem was fixed for missing FRU information in SRC
11001515. SRC
11001515 was logged indicating replacement of power supply hardware,
but did not include the location code, the part number, the CCIN, or
the serial number.
- A problem was fixed for systems with a corrupted date of
"1900" showing
for the Update Access Key (UAK). The firmware update is allowed
to
proceed on systems with a bad UAK date because the fix is in an
emergency service pack. After the fix is installed, the user
should
correct the UAK date, if needed, by using the original UAK key for the
system. On the Management Console, enter the original
update access
key via the "Enter COD Code" panel. Or on the Advanced System Manager
Interface (ASMI), enter the original update access key via the
"On
Demand Utilities/COD Activation" panel.
- A problem with concurrent PCIe adapter maintenance was
fixed that
caused On-Chip Controller (OCC) resets with SRCs logged of B18B2616 and
BC822029, forcing the system into safe mode (processor
voltage/frequency reduced to a "safe" level where thermal monitoring is
not required). Recovery from safe mode requires a system re-IPL.
- A problem was fixed for I/O adapters so that BA400002
errors were
changed to informational for memory boundary adjustments made to the
size of DMA map-in requests. These DMA size adjustments were
marked as
UE previously for a condition that is normal.
System firmware changes that
affect certain systems
- On systems in PowerVM mode, a problem was fixed for
unresponsive PCIe adapters after a partition power off or a partition
reboot.
- On systems using Virtual Shared
Processor Pools (VSPP), a problem was fixed for an inaccurate pool idle
count over a small sampling period.
- On systems with partitions using shared
processors, a problem was fixed that could result in latency or timeout
issues with I/O devices.
- On systems using PowerVM, a
problem was fixed for a hypervisor deadlock that results in the system
being in a "Incomplete state" as seen on the management console.
This
deadlock is the result of two hypervisor tasks using the same locking
mechanism for handling requests between the partitions and the
management console. Except for the loss of the management console
control of the system, the system is operating normally when the
"Incomplete state" occurs.
- On systems with memory mirroring enabled,
a problem was fixed for PowerVM over-estimating its memory needs,
allowing more memory to be used by the partitions.
- On systems using PowerVM, a problem was
fixed for the handling of the error of multiple cache hits in the
instruction effective-to-real address translation cache (IERAT).
A
multi-hit IERAT error was causing system termination with SRC
B700F105. The multi-hit IERAT is now recognized by the hypervisor
and
reported to the OS where it is handled.
- On systems using PowerVM, a
problem was fixed to allow booting off an iSCSI device. For the
failure, the partition firmware error logs had SRC BA012010 "Opening
the TCP node failed." and SRC BA010013 "The information in the error
log entry for this SRC provides network trace data." The open
firmware
standard output trace showed SRC BA012014 "The TCP
re-transmission
count of 8 was exceeded. This indicates a large number of lost packets
between this client and the boot or installation server" followed by
SRC BA012010.
- On systems using PowerVM, support was added for USB 2.0
HUBs so that a keyboard plugged into the USB 2.0 HUB will work
correctly at the SMS menus. Previously, a keyboard plugged into a
USB 2.0 HUB was not a recognized device.
|