SC820
For Impact, Severity and other Firmware definitions, Please
refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs
The complete Firmware Fix History for this
Release Level can be
reviewed at the following url:
http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/SC-Firmware-Hist.html
|
SC820_099_047 / FW820.40
05/04/16 |
Impact: Availability
Severity: SPE
New Features and Functions
- Support was added for the Stevens6+ option of the internal
tray loading DVD-ROM drive with F/C #EU13. This is an 8X/24X(max)
Slimline SATA DVD-ROM Drive. The Stevens6+ option is a FRU
hardware replacement for the Stevens3+. MTM 7226-1U3
(Oliver) FC 5757/5762/5763 attaches to IBM Power Systems and
lists Stevens6+ as optional for Stevens3+. If the Stevens6+
DVD drive is installed on the system without the required firmware
support, the boot of an AIX partition will fail when the DVD is used as
the load source. Also, an IBM i partition cannot consistently
boot from the DVD drive using D-mode IPL. A SRC C2004130 may be
logged for the load source not found error.
System firmware changes that affect all systems
- A problem was fixed
for a system IPL hang at C100C1B0 with SRC 1100D001 when the power
supplies have failed to supply the necessary 12-volt output for the
system. The 1100D001 SRC was calling out the planar when it
should have called out the power supplies. With the fix, the
system will terminate as needed and call out the power supply for
replacement. One mode of power supply failure that could trigger
the hang is sync-FET failures that disrupt the 12-volt output.
- A problem was fixed for the callout of a VPD collection
fault and system termination with SRC 11008402 to include the 1.2vcs
VRM FRU. The power good fault fault for the 1.2 volts would be a
primary cause of this error. Without the fix, the VRM is missing
in the callout list and only has the VPDPART isolation procedure.
- On multi-node systems with a power fault, a problem was fix
for On-Chip Controller errors caused by the power fault being reported
as predictive errors for SRC B1602ACB. These have been corrected
to be informational error logs. If running without the fix, the
predictive and unrecoverable errors logged for the OCC on loss of power
to the node can be ignored.
- A problem was fixed for excessive logging of the SRC
11002610 on a power good (pgood) fault when detected by the Digital
Power Subsystem Sweep (DPSS). Multiple pgood interrupts are
signaled by the DPSS in the interval between the first pgood failure
and the node power down. A threshold was added to limit the
number of error logs for the condition.
- A problem was fixed for redundant logging of the SRC
B1504804 for a fan failure, once every five seconds. With the
fix, the failure is logged only at the initial time of failure in the
IPL.
- A problem was fixed for a false unrecoverable error (UE)
logged for B1822713 when an invalid cooling zone is found during the
adjustment of the system fan speeds. This error can be ignored as
it does not represent a problem with the fans.
- On a multi-node system, a problem was fixed for a
power fault with SRC 11002610 having incorrect FRU callouts. The
wrong second FRU callout is made on nodes 2, 3, and 4 of a multi-node
system. Instead of calling out the processor FRU, the enclosure
FRU is called out. The first FRU callout is correct.
- A problem was fixed for a processor clock failover error
with SRC B158CC62 calling out all processors instead of isolating to
the suspect processor. The callout priority correctly has a clock
and a procedure callout as the highest priority, and these should be
performed first to resolve the problem before moving on to the
processors.
- A problem was fixed for a system checkstop caused by a L2
cache least-recently used (LRU) error that should have been a
recoverable error for the processor and the cache. The cache
error should not have caused a L2 HW CTL error checkstop.
- A problem was fixed for priority callouts for system clock
card errors with SRC B158CC62. These errors had high priority
callouts for the system clock card and medium callouts for FRUs in the
clock path. With the fix, all callouts are set to medium priority
as the clock card is not the most probable FRU to have failed but is
just a candidate among the many FRUs along the clock path.
- A problem was fixed for PCIe switch recovery to prevent a
partition switch failure during the IPL with error logs for SRC
B7006A22 and B7006971 reported. This problem can occur when doing
recovery for an informational error on the switch. If this
problem occurs, the partition must be restarted to recover the affected
I/O adapters.
- A problem was fixed to correct the error messages for early
failures in the Live Partition Mobility (LPM) migration of a
partition. The management console might report an unrelated error
such as "HSCLA27E The operation to lock the physical device
location for target adapter" when the actual error might be not enough
available memory on the target CEC to run the migration. With the
fix, the correct error code is returned so there is enough information
to correct the error and retry the migration.
- A problem was fixed for a hypervisor task hang during a FRU
exchange on the PCIe3 I/O expansion drawer (#EMX0) that requires the
entire drawer to power off and power on again. The activation
phase for the power on may never complete if a very rare sequence of
events occurs during the power on step. The FRUs to exchange that
would cause the expansion drawer to power off and power on are
the following: midplane, I/O module, I/O module VRM, chassis
management card (CMC), cable card, and active optical cable.
- A problem was fixed for PCIe adapter hangs and network
traffic error recovery during Live Partition Mobility (LPM) and SR-IOV
vNIC (virtual ethernet adapter) operations. An error in the
PCI Host Bridge (PHB) hardware can persist in the L3 cache and fail all
subsequent network traffic through the PHB. The PHB error
recovery was enhanced to flush the PHB L3 cache to allow network
traffic to resume.
- A problem was fixed for a Qualys network scan for security
vulnerabilities causing a core dump in the Intelligent Platform
Management Interface (IPMI) process on the service processor with
SRC B181EF88. The error occurs anytime the Qualys scan is run
because it sends an invalid IPMI session id that should have been
handled and discarded without a core dump.
- A problem was fixed for error recovery from failed Live
Partition Mobility (LPM) migrations. The recovery error is caused
by a partition reset that leaves the partition in an unclean state with
the following consequences: 1) A retry on the migration for the
failed source partition may not not be allowed; and 2) With enough
failed migration recovery errors, it is possible that any new migration
attempts for any partition will be denied. This error condition
can be cleared by a re-IPL of the system. The partition recovery error
after a failed migration is much more likely to occur for
partitions managed by the Integrated Virtualization Manager (IVM) but
it is still possible to occur for Hardware Management Console (HMC)
managed partitions.
- A problem was fixed for a L2 cache error on the service
processor that caused the service processor to reset or go to a failed
state with SRC B1817212 on systems with a single service
processor. On systems with redundant service processors, the
failed service processor would get guarded with a B151E6D0 or B152E6D0
SRC depending on which service processor fails. With the fix, the
L2 cache error is handled with single-bit corrected with no error to
the service processor, so it can continue normal processing. The
L2 cache data error that causes this fail is infrequent and the service
processor requires its limit of three resets in fifteen minutes to be
exceeded for the service processor to fail, so service processor
failure rate for this problem is low.
- A security problem was fixed in OpenSSL for a possible
service processor reset on a null pointer de-reference during RSA PPS
signature verification. The Common Vulnerabilities and Exposures issue
number is CVE-2015-3194.
- A security problem was fixed in the lighttpd server on the
service processor, where a remote attacker, while attempting
authentication, could insert strings into the lighttpd server log
file. Under normal operations on the service processor, this does
not impact anything because the log is disabled by default. The
Common Vulnerabilities and Exposures issue number is CVE-2015-3200.
- A problem was fixed for a hypervisor adjunct partition
failed with "SRC B2009008 LP=32770" for an unexpected SR-IOV adapter
configuration. Without the fix, the system must be re-IPLed to
correct the adjunct error. This error is infrequent and can only
occur if an adapter port configuration is being changed at the same
time that error recovery is occurring for the adapter.
- A problem was fixed for a missing error log when a clock
card fails over to the backup clock card. This problem causes
loss of redundancy on the clock cards without a callout notification
that there is a problem with the FRU. If the fix is applied to a
system that had a failed clock, that condition will not be known until
the system is IPLed again when a errorlog and callout of the clock card
will occur if it is in a persisted failed state.
- A problem was fixed for the service processor going to the
reset state instead of the termination state when the anchor card is
missing or broken. At the termination state, the Advanced System
Manager Interface (ASMI) can be used to collect failure data and debug
the problem with the anchor card.
System firmware changes that affect certain systems
- On systems with AIX or Linux encapsulated state partitions,
a problem was fixed for a Live Partition Mobility migration failure for
the encapsulated state partitions. The migration fails on the
target CEC when the associated paging space needed to support the
encapsulated state is not available. Removing the "Encapsulated
State" attribute from the partition would allow the migration to
succeed. However, removing this attribute can only be
accomplished if the partition in the powered off state.
Encapsulated State partitions are needed for the remote restart
feature. An encapsulated state partition is a partition in which
the configuration information and the persistent data are stored
external to the server on persistent storage. A partition that
supports remote restart can be restarted remotely. For more
information on the remote start feature, refer to this IBM Knowledge
Center link: http://www.ibm.com/support/knowledgecenter/P8DEA/p8efd/p8efd_lpar_general_props.htm
- For Integrated Virtualization Manager (IVM) managed systems
with more than 64 active partitions, a problem was fixed for recovery
from Live Partition Mobility (LPM) errors. Without the fix, the
IVM managed system partition can appear to still be running LPM
after LPM has aborted, preventing retries of the LPM operation.
In this case, the partition must be stopped and restarted to clear the
LPM error state. The problem is not frequent because it requires
a failed LPM on a partition with a partition ID that is greater than
64.
- On systems with an invalid P-side or T-side in the
firmware, a problem was fixed in the partition firmware Real-Time
Abstraction System (RTAS) so that system Vital Product Data (VPD) is
returned at least from the valid side instead of returning no VPD
data. This allows AIX host commands such as lsmcode, lsvpd,
and lsattr that rely on the VPD data to work to some extent even if
there is one bad code side. Without the fix, all the VPD
data is blocked from the OS until the invalid code side is recovered by
either rejecting the firmware update or attempting to update the system
firmware again.
- A problem was fixed for an incorrect date in partitions
created with a Simplified Remote Restart-Capable (SRR) attribute where
the date is created as Epoch 01/01/1970 (MM/DD/YYYY). Without the
fix, the user must change the partition time of day when starting the
partition for the first time to make it correct. This problem
only occurs with SRR partitions.
- On systems using PowerVM firmware with dedicated processor
partitions, a problem was fixed for the dedicated processor
partition becoming intermittently unresponsive. The problem can be
circumvented by changing the partition to use shared processors.
|
SC820_091_047 / FW820.30
11/18/15 |
Impact: Availability
Severity: HIPER
New Features and Functions
- The firmware code update process was enhanced with a
feature to block a firmware "downgrade" to a level that is below the
system's manufactured code level.
- Support was added to the Advanced System Management
Interface (ASMI) to be able to add a IPv4 static route definition for
each ethernet interface on the service processor. Using a static
route definition, a Hardware Management Console (HMC) configured
on a private subnet that is different from the service processor subnet
is now able to connect to the service processor and manage the
CEC. A static route persists until it is deleted or until the
service processor settings are restored to manufacturing
defaults. The static route is managed with the ASMI panel
"Network Services/Network Configuration/Static Route Configuration"
IPv4 radio button. The "Add" button is used to add a static route
(only one is allowed for each ethernet interface) and the "Delete"
button is used to delete the static route.
System firmware changes that affect all systems
- HIPER/Pervasive:
A problem was fixed for recovering from embedded MultiMediaCard (eMMC)
flash NAND errors and three other low-level boot errors that caused the
service processor to go to a failed state with SRC B1817212 on systems
with a single service processor. On systems with redundant
service processors, the failed service processor would get guarded with
a B151E6D0 or B152E6D0 SRC depending on which service processor
fails. Other low-level boot errors included in this fix:
1) A system reset to clear the boot registers may be erroneously
handled as a chip reset causing the service processor to enter a
stopped state and become unresponsive.
2) Improves recovery for a defective file system partition table that
causes the service processor to lose the ability to perform P and
T (Permanent and Temporary) side switch.
3) Do not fail on a dump partition full condition as this is normal
when a service processor has a maximum number of service processor
dumps active.
For each of these issues, on systems with redundant service processors,
the failed service processor would get guarded with a B151E6D0 or
B152E6D0 SRC depending on which service processor fails.
- HIPER/Non-Pervasive:
A problem associated with workloads
using transactional memory on PowerVM was discovered and is fixed in
this service pack. The effect of the problem is non-deterministic but
may include undetected corruption of data.
- HIPER/Non-Pervasive:
A problem was fixed for
recovery from PNOR flash memory corruption that causes the IPL to fail
with SRC D143900C. This is very rare and only has happened in IBM
internal labs. Without the fix, the service processor cannot
correct
the corruption in the PNOR. If a system has the problem SRC
and
cannot IPL, then that system must be disruptively firmware
updated to
apply the fix to be able to IPL again.
- DEFERRED: A
problem was fixed for memory on-die
termination (ODT) settings to improve the signal integrity of the
memory channel.
- DEFERRED: A
problem was fixed for a TCP/IP performance degradation on PCIe ethernet
adapters with Remote Direct Memory Access (RDMA) over Converged
Ethernet (RoCE). By adjusting the system memory caching, a
significant improvement was made to the data throughput speed to
restore performance to expected levels. This fix requires a
system re-IPL to take effect.
- DEFERRED: A
problem was fixed for a hang in the processor and cache memory that
causes a system checkstop with SRC B181E540 logged with a processor FRU
callout. The error log details include "Description:
Runtime diagnostics has detected a problem on a memory bus" and
"Signature Description: mcs(n0p0c6) (MCIFIR[40]) CHANNEL TIMEOUT
ERROR" and "Multi-Signature List: ex(n0p0c14) (L3FIR[24]) L3 Hw
Control Error". The trigger for the hang error is speculative DMA
partial writes into cache and the frequency of the error varies with
the workload, but may happen several times a month. A re-IPL of
the system is needed for this fix to take effect after a concurrent
firmware update of the service pack.
- A problem was fix for certain error logs not being reported
to the OS. The error occurs when the hypervisor is not ready to
receive an error log message and rejects it. The error log
handler on the service processor was not retrying until the error log
was successfully delivered. Until the fix is applied, there will
be a small loss of error logs when the hypervisor is initializing
during the IPL as these will get discarded until the hypervisor is
ready. The missing error logs may be viewed from the service
processor using the Advanced System Management Interface (ASMI) or may
be viewed as serviceable events on the management console if there is
one attached.
- A problem was fixed for the error reporting of multiple AC
power losses so that all occurrences of the power losses are
logged. With the problem, only the first AC power loss for SRC
10001510 is reported, with subsequent power faults not being
reported. Until the fix is applied, a re-IPL of the CEC will
re-enable power supply problem reporting.
- A problem was fixed for a SRC 11002613 logged during a
concurrent repair of a power supply. This SRC was erroneously
logged and did not represent a real problem.
- A problem was fixed for an intermittent SRC B1504804 logged
on a re-ipl of the CEC but that did not result in an IPL failure.
This problem is a inability of the service processor to do a read from
the IIC bus resulting from incorrect device lock management. This
problem has no adverse impact on the system other than a predictive
error log and can be ignored until the fix is applied.
- A problem was fixed for a bad Time of Day (TOD) battery
with SRC B15A3305 calling out the P1 Backplane instead of the P1-E2
Battery. This occurs whenever the TOD battery becomes bad.
Until the fix is applied, always replace the battery FRU for this SRC
as the first repair action.
- A problem was fixed for the capture of the registers for
the Hostboot Self-Boot Engine (SBE) for SBE failures. These
registers had been missing from failure data for SBE failures, making
these problems more difficult to debug.
- A problem was fixed for an Advanced System Management
Interface (ASMI) error message of "Error in function 'connect", error
code 111" when a browser attempted to connect before the service
processor was ready. The browser connection through the web
server is now held off until the ASMI process is ready after a reset of
the service processor or a AC power cycle of the system. Until
the fix is applied, the ASMI user can wait one or two minutes and then
retry the operation.
- A problem was fixed for an incorrect call home for SRC
B1818A0F. This call home can be ignored. It occurs rarely
only in the case of dynamic IP configuration for the service processor
when it fails to acquire a IP address from the Dynamic Host
Configuration Protocol (DHCP) server. Unit the fix is applied,
use the information from the SRC and network topology to
understand why the DHCP client cannot acquire an IP address as this is
normally a network configuration error.
- A problem was fixed for a system dump re-IPL that failed
with SRC B1818601 and B181460B after processor core checkstops had
terminated the system. The failed processor cores created a
complex condition that prevented a successful dump collection of all
the hardware objects. Until the fix is applied, the checkstop
processor problems will have to be debugged with partial data from the
degraded dump collections that have the failure SRCs.
- A problem was fixed for an infrequent service processor
database corruption during concurrent firmware update that caused the
system to terminate with a UIRA impact to the customer. The cause
of the database corruption is undetermined but the problem is resolved
by the service processor making a backup of the data that can be
restored, if needed, to allow the firmware updates to complete
successfully.
- A problem was fixed for Advanced System Management
Interface (ASMI) TTY to allow "admin" passwords to be greater than
eight characters in length to be consistent with prior generations of
the product. The ASMI web interface works correctly for user
"admin" passwords with no truncation in the length of the passwords.
- A problem was fixed for a local clock card (LCC)
failure with SRC 11001515 that was missing a part number and location
code. This information has been added for LCC faults so the FRU
to replace is properly identified.
- A problem was fixed for a defective PCI oscillator in the
local clock card (LCC) with SRC BC58090F that caused a IPL failure for
the node instead of failing over to the redundant LCC. For a
multi-node system, the failure is isolated to the node with the
bad LCC and the other nodes are able to IPL.
- A problem was fixed for a service processor dump with error
logs B181E911 and B181D172 during an IPL. The error logs
were for the detection of defunct processes but otherwise the IPL was
successful.
- A problem was fixed for missing Keyword (KW) and Resource
ID (RID) for SRC B181A40F.
- A problem was fixed for a I2C bus lock error during a CEC
power off that caused a ten minute delay for the power off and
errorlog SRCs B1561314 and B1814803 with error number (errno) 3E.
- A problem was fixed for Advanced System Management
Interface (ASMI) help text for menu "I/O Adapter Enlarged Capacity"
being missing with the system IPLed and partitions running. The
help text, shown below, is now available for the system in the powered
on state as well as in the powered off state.
"I/O Adapter Enlarged Capacity
This option controls the size of PCI memory space allocated to each PCI
slot.
When enabled, the selected number of PCI slots, including those in
external I/O subsystems, receive the larger DMA and memory mapped
address space.
Some PCI adapters may require this additional DMA or memory space, per
the adapter specification.
This option increases system mainstore allocation to these selected PCI
slots.
Enabling this option may result in some PCI host bridges and slots not
being configured because the installed mainstore is insufficient to
configure all installed PCI slots."
- A problem was fixed for recovering from a misplug of the
service processor FSI cables (U2-P1-C10-T2 and U1-P1-C9-T2) where the
plug locations are reversed from what would be a proper
connection. Without the fix, the bad FSI connections cause the
service processors to go to the service processor stop state.
With the fix applied, the error logs call out the bad cables so they
can be repaired and the service processor remains in a working state.
- For a partition that has been migrated with Live Partition
Mobility (LPM) from FW730 to FW740 or later, a problem was fixed for a
Main Storage Dump (MSD) IPL failing with SRC B2006008. The MSD
IPL can happen after a system failure and is used to collect failure
data. If the partition is rebooted anytime after the migration,
the problem cannot happen. The potential for the problem existed
between the active migration and a partition reboot.
- A problem was fixed for partial loss of Entitlement for
On/Off Memory Capacity On Demand (also called Elastic COD). Users
with large amounts of Entitlement on the system of greater than "65535
GB * Days" could have had a truncation of the Entitlement value on a
re-IPL of the system. To recover lost Entitlement, the customer
can request another On/Off Enablement Code from IBM support to
"re-fill" their entitlement.
- A problem was fixed for a management console command line
failure with a return code 0x40000147 (invalid lock state) when trying
to delete SR-IOV shared mode configurations. This could have
occurred if the adapter slot had been re-purposed without involvement
of the management console and was owned and operational at the time of
the requested delete. With the fix, the current ownership of the
slot is honored and only the SR-IOV shared mode configuration data is
deleted on the force delete.
- A problem was fixed for an incorrect restriction on the
amount of "Unreturned" resources allowed for a Power Enterprise
Pool (PEP). PEP allows for logical moving of resources
(processors and memory) from one server to another. Part of this
is 'borrowing' resources from one server to move to another. This may
result in "Unreturned" resources on the source server. The management
console controls how many total "Unreturned" PEP resources can
exist. For this problem, the user had some "Unreturned" PEP
memory and asked to borrow more but this request was incorrectly
refused by the hypervisor.
- On systems where memory relocation (as done by using Live
Partition Mobility (LPM)) and a partition reboot are occurring
simultaneously, a problem for a system termination was fixed. The
potential for the problem existed between the active migration and the
partition reboot.
- A problem was fixed that was corrupting the Update Access
Key (UAK) date with a corrupted date of "1900". The user
should correct the UAK date, if needed, to allow the firmware update to
proceed, by using the original UAK key for the system. On the
Management Console, enter the original update access key via the
"Enter COD Code" panel. Or on the Advanced System Management Interface
(ASMI), enter the original update access key via the "On Demand
Utilities/COD Activation" panel.
- A problem was fixed for recovery from unaligned addresses
for MSI interrupts from PCIe adapters. The recovery prevents an
adapter timeout caused by resource exhaustion. With the fix, the
resources for each bad interrupt are returned, allowing the PCIe
adapter to continue to run for the normal traffic.
- A problem was fixed for a machine check incorrectly issued
to an IBM i partition running 7.2 or later with 4K sector disks.
- A problem was fixed for an extraneous PCIe switch SRC
B7006A22 being called out when a there is a valid PCIe expansion
drawer cable problem with SRC B7006A88 reported. The callout for
SRC B7006A22 should be ignored as the PCIe switch hardware is working
for this case.
- A problem was fixed for a Network boot/install failure
using bootp in a network with switches using the Spanning Tree Protocol
(STP). A Network boot/install using lpar_netboot on the
management console was enhanced to allow the number of retries to be
increased. If the user is not using lpar_netboot, the number of
bootp retries can be increased using the SMS menus. If the SMS
menus are not an option, the STP in the switch can be set up to allow
packets to pass through while the switch is learning the network
configuration.
- A problem was fixed for PCIe3 adapters failing when
requesting more than 32 Message Signaled Interrupts (MSI-X). The
adapter may fail to ping or cause OS tasks to hang that are using the
adapter. This problem was found specifically on the 10 Gb
Ethernet-SR (Short Range) PCIe3 adapter with feature codes #5275 and
#5769 and on the 56 Gb Infiniband (IB) Fourteen Data Rate (FDR) adapter
with feature codes #EC32, #EC33, #EL3D, and #EL50 and CCIN 2CE7.
However, other PCIe adapters may also be affected.
- A security problem was fixed for an OpenSSL
specially crafted X.509 certificate that could cause the service
processor to reset in a denial-of-service (DOS) attack. The
Common Vulnerabilities and Exposures issue number is CVE-2015-1789.
- A problem was fixed for false errors reported with SRC
B1812663 for the On-Chip Controller (OCC). These error logs can
be ignored as these are caused by a prior error log using a buffer that
is not properly sized for the log data.
- A problem was fixed to prevent recoverable power faults of
short duration from causing the system to lose power supply
redundancy. Without the fix, the faulted state persisted for the
recovered power fault, causing a problem with a system power off if
other power supplies were lost at a later time.
- A problem was fixed to guard a failed processor during an
IPL instead of hanging with SRC B1813450 reported to the error log.
- A problem was fixed for an intermittent PSI link error with
SRC B15CDA27 after a firmware update or reset/reload of the service
processor.
- A problem was fixed for hardware system dump collection
after a hardware checkstop that was missing scan ring data. This
is a very infrequent problem caused by an error with timing in the
multi-threaded dump collection process. Until this fix is
applied, the debug of some hardware dump problems may require doing
multiple dump collections to get all the data.
- A problem was fixed for an Advanced System Managementr
Interface (ASMI) error that occurred when trying to display detail on a
deconfigured Anchor Card VPD. If the error log for the selected
deconfiguration record had been deleted, it caused ASMI to core
dump. With the fix, if the error log for deconfiguration
record is missing, the error log details such as failing SRC for the
deconfiguration record are returned as blank.
- A problem was fixed for an Operations Panel SRC of B1504804
with no FRU callout. A callout of the failed hardware has been
added.
- A problem was fixed for guarding failed hardware
dynamically during the IPL to prevent the IPL from terminating.
Without the fix, certain hardware failures will not be called out
to handled by the reconfiguration loop, Until the fix is applied,
multiple IPL attempts may be needed if hardware is failing.
- A problem was fixed for a processor error causing a
Hostboot terminate instead of a deconfiguration of the bad hardware and
continuation of the IPL. The state of the processors was
synchronized between the service processor and the Hostboot process to
correct the error.
- A problem was fixed for the recovery of a failing PCI clock
so that a failover to the backup PCI clock occurs without a node
failing and being deconfigured. Without the fix, the PCI clock
does not behave as a redundant FRU and faults on it will cause the CEC
to terminate. A re-IPL of the CEC recovers it from the PCI clock
error with the bad clock guarded so that the other PCI clock is used.
- A problem was for fixed so that error logs are now
generated for thermal errors detected by the service processor.
Without the fix, thermal errors such as a temperature over the
threshold will not get reported in the error log but higher fan speeds
will be present as an indicator of the thermal problem. Until the
fix is applied, the error log and call home mechanism cannot be relied
on to monitor for system thermal problems.
- A problem was fixed for processor core checkstops that
cause an LPAR outage but do not create hardware errors and service
events. The processor core is deconfigured correctly for the
error. This can happen if the hypervisor forces processor
checkstops in response to excessive processor recovery.
- A problem was fixed for recovery from a processor local bus
(PLB) hang on the service processor. The errant PLB hang recovery
would be seen in concurrent firmware updates that, on rare occasions,
fail to do a side switch to activate to the new level of
firmware. On the management console, the error message would be
HSCF010180E Operation failed ... E302F873 is the error code."
Other than the failed code level activation, the firmware update is
successful. If this problem occurs, the system can be set to the
new firmware level by doing a power off from the management console and
then doing a power on with side switch selected in the advanced
properties.
System firmware changes that affect certain systems
- On a system with
redundant service processors where redundancy is disabled, a problem
was fixed for an unrecoverable (UE) SRC B181DA19 being logged on a
re-IPL after a checkstop error. The error log did not interfere
with the re-IPL which was successful. The error log is for an
active Processor Support Interface (PSI) link not being found for the
backup service processor. This is a correct condition when
redundancy is disabled, so the error log should not have been
generated. Until the fix is applied, the error code can be
ignored.
- On multiple-node systems, a problem was fixed for
extraneous error logs after a 12V power fault with SRC 11002610.
After system termination, there are additional 110026B0 and 110026B3
error log entries that can be ignored.
- On a system with redundant service processors, a problem
was fixed for the isolation procedures for an Anchor card error and
system VPD collection failure with termination SRC B181A40F .
FSPSP04 and FSPSP06 are no longer called out as part of reporting the
VPD collection failure. FSPSP30 has been updated with isolation
steps for this problem and is called out and should be used for the
problem isolation. Retain tip H213935 also provides the FRU
isolation steps. Procedure FSPSP30 tries to replace the service
processor first. If that does not work, then the procedure has
the Anchor card replaced.
- On a system with redundant service processors, a problem
was fixed for failovers to the backup service processor that caused an
On-Chip Controller (OCC) abort. This placed the CEC in a "safe"
mode where it ran at reduced processor clock frequencies to prevent
exceeding the power limits while not under OCC control.
- On a system with an IBM i partition using Active Memory
Sharing (AMS), a problem was fixed for internal memory management
errors caused by deleting a IBM i partition that had been powered off
in the middle of a Main Storage Dump (MSD). Until the fix is
installed, if a MSD is interrupted for a IBM i partition that has AMS,
the partition should be powered on and powered off normally before a
delete of the partition is done to prevent errors with unpredictable
affects.
- On systems using PCIe adapters in SR-IOV mode, a problem
was fixed for occasional B200F011 and B2009008 SRCs that can occur
during an IPL, moving a adapter into SR-IOV mode, or with SR-IOV link
up/down activity.
- On systems using PCIe adapters in SR-IOV mode, the
following problems were addressed with an Avago Technologies adapter
firmware update to 10.2.252.1905: 1) Eliminating virtual function
(VF) transmit errors during VF resets and 2) Preventing loss of
legacy flow control when an adapter port is connected to a priority
flow control (PFC) capable switch.
- On a system with redundant service processors, a problem
was fixed for a firmware update causing an error log server dump with
SRC B1818601. The error log server restarted automatically to
recover from the error and the firmware update was successful.
- On a system with a AIX partition and a Linux partition, a
problem was fixed for dynamically moving an adapter that uses DMA from
the Linux partition to the AIX partition that caused the AIX to fail by
going into KDB mode (0c20 crash). The management console showed
the following message for the partition operation: "Dynamic move
of I/O resources failed. The I/O slot dynamic partitioning
operation failed.". The error was caused by Linux using 64K
mappings for the DMA window and AIX using 4K mappings for the DMA
window, causing incorrect calculations on the AIX when it received the
adapter. Until the fix is applied, the adapters that use DMA
should only be moved from Linux to AIX when the partitions are powered
off.
- On a system with redundant service processors, a problem
was fixed for an IPL failure for a bad service processor cable on the
primary service processor with SRCs B1504904 and B18ABAAB logged.
The system should have did an error failover to the backup service
processor and continued the IPL to get the partitions running.
|
SC820_087_047 / FW820.21
09/24/15 |
Impact:
Performance
Severity: HIPER
System firmware changes that affect certain systems
- HIPER/Pervasive:
On
systems using PowerVM with shared processor partitions that are
configured as capped or in a shared processor pool, there was a problem
found that delayed the dispatching of the virtual processors which
caused performance to be degraded in some situations. Partitions
with dedicated processors are not affected. The problem is
rare and can be mitigated, until the service pack is applied, by
creating a new shared processor AIX or Linux partition and booting it
to the SMS prompt; there is no need to install an operating system on
this partition. Refer to help document http://www.ibm.com/support/docview.wss?uid=nas8N1020863
for additional details.
|
SC820_085_047 / FW820.20
07/16/15 |
Impact: Availability
Severity: SPE
New Features and Functions
- Support was added to the Advanced System Management
Interface (ASMI) to display Anchor card VPD failures in the
"Deconfigurations records" menu.
System firmware changes that affect all systems
- DEFERRED: A problem
was fixed
for the fabric bus to allow a processor clock failover to be completed
without a checkstop of the CEC. A skew between the primary
and
secondary processor clock signal was eliminated to fix the problem.
- DEFERRED: On
systems with memory mirroring enabled, a problem was fixed for PowerVM
over-estimating its memory needs, allowing more memory to be used by
the partitions. To free up the memory for the partitions that the
hypervisor does not need, the CEC must be re-ipled after the fix is
applied.
- DEFERRED: A
problem was fixed for the hypervisor being unable to make a partition
configuration change when all licensed memory is in use by the
partitions. An insufficient storage error is returned to the management
console and the management console may go to the incomplete state for
the CEC.. The hypervisor management of memory fragments has been
improved so that partition configuration changes can be made when all
licensed memory is in use. To make this additional memory
available for the partition changes, the CEC must be re-ipled
after the fix is applied.
- A problem was fixed for a missing SRC if the operations
panel failed while the system was running. A B156A023 SRC is now
logged if the operations panel fails or is removed while the system is
running.
- A problem was fixed that prevented a second management
console from being added to the CEC. In some cases, network
outages caused defunct management console connection entries to remain
in the service processor connection table, making connection
slots unavailable for new management consoles A reset of the
service processor could be used to remove the defunct entries.
- A problem was fixed for a missing SRC when a Universal
Power Interconnect Cable (UPIC) to the system control unit (SCU) failed
or became loose while the system was running. Up to four hot
pluggable UPIC cables (#ECCA and #ECCB) provide redundant power to the
SCU but only one is needed for operation. When a UPIC cable fails
now, a SRC 11008802 is logged and calls out the lost of one of the
redundant power cables.
- A problem was fixed for a false guarding and call out of a
PSI link with SRC B15CDA27. This failure is very infrequent but
sometimes seen after the reset/reload of the service processor during a
concurrent firmware update. Since there is no actual
hardware failure, a manual unguarding of the PSI link allows it to be
reused.
- A problem has been fix for the LED lights being
interchanged for the Universal Power Interconnect Cable (UPIC) and the
GFSP interface card FRUs on the system node. The GFSP interface
card has CCIN 6B2E and part number 00E2598 with location codes of
Un-P1-C9-T2 and Un-P1-C10-T2. The UPIC cables have part numbers
00FX185 and 00FX186 with location codes Un-P1-C9-T1 and Un-P1-C10-T1.
- A problem was fixed for a CEC power off error with SRC
B1818903 logged. The error causes a dump and reset of the service
processor that allows the power off operation to complete.
- A problem was fixed for a two to four minute delay that
could occur when performing an Administrative Failover (AFO) of the
service processor. An On-Chip Controller (OCC) deadlock was
occurring in the service processor, leaving both service processors
into the backup role. This error state is automatically
corrected by the hypervisor with a host-initiated reset/reload when it
cannot find a service processor in the primary role after the delay
time-out period.
- A problem was fixed for losing power capping capability in
the On-Chip Controllers (OCCs) after a service processor
failover. When this occurs. an UE B1702A03 SRC is logged by the
OCC. To restore power capping, shut down all partitions and
power off the CEC. IPL the CEC again to restore power capping.
- A problem was fixed for the error handling of a Local Clock
and Control(LCC) card failure in a system node that triggers a flood of
FDAL informational SRCs of B1504800 to the error log, causing the
service processor to run out of memory and reset with a failover to the
backup service processor. The LCC has CCIN 682D and part number
00E2394 with location codes Un-P1-C11 and Un-P1-C12 as it is redundant
in each system node.
- A problem was fixed for a IPL failure with SRC B181BC04
when a system node was added to the CEC at service processor
standby. The new system node hardware was not added correctly to
the hardware scan ring and a AC power cycle of the CEC was needed to
fix the error.
- A problem was fixed for missing hardware data in system
dumps created for hardware checkstops. A certain class of
hardware scan rings were being skipped during the dump collection and
these are now included so that all the hardware data is available for
problem debug.
- A problem was fixed for missing "fastarray" data in
hardware dump type HWPROC. The "fastarray" contains debug
information for the processor cores.
- A problem was fixed for the Advanced System Management
Interface (ASMI) to allow removal of Hardware Management Console (HMC)
connections that have been temporarily disconnected. In some
instances, the ASMI "System Configuration/Hardware Management Consoles"
button for "Remove Connection" was not being shown.
- A problem was fixed for the Advanced System Management
Interface (ASMI) IPv4 Network Configuration where the IP address
was being overwritten by value in the subnet mask field for the initial
values of the panel. If the network configuration was saved
without fixing the IP address, the wrong IP address was also saved.
- A problem was fixed for missing call outs when having
multiple "Memory Card/FRU" failures with SRC B124E504. There is a
call out for the first memory FRU of the failures but any other memory
FRUs failing at the same time were not reported.
- A problem was fixed for Administrative Failover (AFO)
having error log SRC B1818601. This error did not prevent the AFO
from completing as the backup service processor became the primary
service processor.
- A problem was fixed for an intermittent problem in a CEC
IPL where an On-Chip Controller is stuck in a reset loop, logging
repeated SRCs for B1702A17, and eventually places the CEC in safe mode,
running at minimum processor clock frequencies.
- A problem was fixed for errors during a CEC power off with
SRCs B1812616 and B1812601. These occurred if the CEC was powered
off immediately after a power on such that the On-Chip Controllers
(OCCs) had to shutdown during their initialization.
- A problem was fixed for a highly intermittent IPL failure
with SRC B18187D9 caused by a defunct attention handler process.
Without this fix, the IPL will continue to fail until the service
processor is reset.
- A problem was fixed to add the callouts for the fan FRUs
for system fan faults with SRCs 11007610, 11007620, and 11007630.
The fan FRU with CCIN 6B42, part number 00E9335, and location code
Un-A1 is now included as needed.
- A problem was fixed for an Administrative Failover (AFO)
having error log SRC B185270E. This error did not prevent the AFO
from completing as the backup service processor became the primary
service processor. The error log has been made
informational as it is a normal occurrence when fan speeds are adjusted.
- A problem was fixed to allow adding a system node with only
one working Local Clock and Control (LCC) card and being able to IPL
the system node. The LCC is redundant, so a broken or missing LCC
should not cause an IPL to fail. The problem can be circumvented
by using the Advanced System Management Interface (ASMI) command line
on the primary service processor to run this command "rmgrcmd
--primary-lcc force-init" and then do the IPL.
- A problem was fixed for finding the path to the second
Local Clock and Control (LCC) card when a LCC card has failed to ensure
proper redundancy for the LCC and the system node.
- A problem was fixed for incorrect FRU callouts for Power
Line Disturbance (PLD) and Processor clock errors.
- A problem was fixed for extra FRU callouts being listed for
SRCs with multiple FRU callouts. The extra callouts are from
previous SRCs and should not have been listed for the current error log
entry.
- A problem was fixed for the Advanced System Management
Interface (ASMI) being allowed to deconfigure a node in a single-node
system. A safe guard was added so that ASMI can only deconfigure
nodes in multi-node CECs.
- A problem was fixed to include PCIe clocks as part of the
minimum hardware check during an IPL. Previously, no error was
logged when a system had no functional PCIe clocks, causing run-time
failures for PCIe I/O operations in partitions.
- A problem was fixed for missing FRU information in SRC
11001515. SRC 11001515 was logged indicating replacement of
power supply hardware, but did not include the location code, the part
number, the CCIN, or the serial number.
- A problem was fixed for concurrent firmware update after
concurrent PCIe adapter maintenance (add, remove, exchange,etc.)
causing the CEC to enter safe mode with its reduced performance.
In safe mode, the processor voltage/frequency is reduced to a "safe"
level where thermal monitoring is not required. Recovery from
safe mode requires a system re-IPL.
- A problem was fixed for an Administrative Failover (AFO)
failing with the backup service processor terminating with UE SRCs
B15738FD and B1573838. This failure was caused by an
intermittent error with the operations panel presence detection during
failover.
- A problem was fixed for an Administrative Failover (AFO)
having error log SRC B1814616 and a fwdbserver core dump. This
error did not prevent the AFO from completing as the backup service
processor became the primary service processor.
- A problem was fixed for a hypervisor deadlock that results
in the system being in a "Incomplete state" as seen on the management
console. This deadlock is the result of two hypervisor tasks
using the same locking mechanism for handling requests between the
partitions and the management console. Except for the loss of the
management console control of the system, the system is operating
normally when the "Incomplete state" occurs.
- A problem was fixed for Live Partition Mobility (LPM)
migrations of Linux partitions running in P8 compatibility mode.
After an active migration, the resumed partition may experience
performance degradation.
- A problem was fixed for a false error message with error
code 0x8006 when creating a virtual ethernet adapter with the
Integrated Virtualization Manager (IVM). The error message can be
ignored as the virtual ethernet slot is fully functional.
- A problem was fixed for the recovery of PCIe adapters for a
device outage occurring on the PCIe3 6-slot fanout module from the
PCIe3 I/O expansion drawer (#EMX0). One or more of the
adapters on the fanout module failed to recover with SRC BA188002.
- A problem was fixed for an unexpected interrupt from a PCIe
adapter that causes the AIX OS to abend. The extra interrupt
comes in from the adapter before it has been enabled for interrupts,
after it has reached End of Information (EOI) for its previous
session. The double
interrupt from the adapter has been corrected.
- On systems using PowerVM, a problem was fixed for the
handling of the error of multiple cache hits in the instruction
effective-to-real address translation cache (IERAT). A multi-hit
IERAT error was causing system termination with SRC B700F105. The
multi-hit IERAT is now recognized by the hypervisor and reported to the
OS where it is handled.
- A problem was fixed for a MDC D-mode IPL that failed if the
MDC load source slots were unoccupied.
- A problem was fixed for systems with a corrupted date of
"1900" showing for the Update Access Key (UAK). The firmware
update is allowed to proceed on systems with a bad UAK date because the
override is set for the service pack. After the fix is installed,
the user should correct the UAK date, if needed, by using the original
UAK key for the system. On the Management Console, enter
the original update access key via the "Enter COD Code" panel. Or on
the Advanced System Manager Interface (ASMI), enter the original
update access key via the "On Demand Utilities/COD Activation" panel.
- A problem was fixed for a hang during a Dynamic Platform
Optimizer (DPO) operation. A system re-IPL was needed to end the DPO
operation.
- A problem was fixed for concurrent firmware updates to a
system that needed to be re-IPLed after getting a B113E504 SRC during
activation of the new firmware level on the hypervisor. The code
update activate failed if the Sleep Winkle (SLW) images were
significantly different between the firmware levels. The SLW
contains the state of the processor and cache so it can be restored
after sleep or power saving operations.
- Support was added for USB 2.0 HUBs so that a keyboard
plugged into the USB 2.0 HUB will work correctly at the SMS
menus. Previously, a keyboard plugged into a USB 2.0 HUB was not
a recognized device.
- A problem was fixed for Live Partition Mobility (LPM) to
prevent a system failure with SRC B700F103 during LPM operations.
When data is moved by LPM, the underlying firmware code requires that
the buffers be 4K aligned, otherwise the system fail could
result. The fixes made now force the buffers to be 4K aligned and
if there is still an alignment issue, the LPM operation will fail
without impacting the system.
- A problem was fixed in the run-time abstraction services
(RTAS) extended error handling (EEH) recovery for EEH events for SR-IOV
Virtual Functions (VFs) to fully reconfigure the VF devices after an
EEH event. Since the physical adapter does recover from the EEH
event itself, and there are no error logs generated, it might not be
immediately apparent that the VF did not fully reconfigure. This
prevents certain PCIe settings from being established for interrupts
and performance settings, leading to unexpected adapter behavior and
errors in the partition.
- A security problem was fixed in OpenSSL where a remote
attacker could crash the service processor with a specially crafted
X.509 certificate that causes an invalid pointer or an out-of-bounds
write. The Common Vulnerabilities and Exposures issue numbers are
CVE-2015-0286 and CVE-2015-0287.
- A problem was fixed for an error log SRC B15738B0 with no
FRU callout for a FSI bus error.
- A problem was fixed for an error log SRC B1504803 with no
FRU callout for a IIC bus error.
- A problem was fixed for a memory error that prevented the
CEC from doing an IPL. The failing DIMM is now deconfigured
during the HostBoot part of the IPL and the failing section of the boot
is retried to get a successful IPL.
- A problem was fixed for a checkstop that occurred for a
failed Local Clock and Control (LCC) card instead of a failover to the
backup LCC card. The fabric bus erroneously detected a TOD
step error during the failover and triggered the checkstop.
- A problem was fixed for an On-Chip Controller (OCC) failure
after a system dump with SRCs B18B2616 and BC822024 reported.
This resulted in the system running with reduced performance in safe
mode, where processor clock frequencies are lowered to minimum levels
to avoid hardware errors since the OCC is not available to monitor the
system. A re-IPL of the system would resolve the problem.
- A problem was fixed for new service processor error logs
not getting created if too many old error logs exist. This
problem can occur if a large number of small error logs get created and
use up all the available inodes (directory entries) for the file
system. The error log garbage collector was not checking the
available number of inodes correctly, so it was not always deleting old
error logs before attempting to create a new error log.
Without the fix, this problem will continue until some error logs
are purged.
|
SC820_075_047 / FW820.12
05/18/15 |
Impact: Function
Severity: ATT
System firmware changes that affect all systems
- A problem was fixed for a clearing of all guard records
associated with one error log entry. If a FRU is replaced for any
of the related guard record, all the related guard records are
cleared. Previously, only the guard record for the replaced FRU
was cleared and the association was lost.
- A fix was made to prevent processor speculative memory
loads from the service processor mailbox Direct Memory Access (DMA)
area in the CEC memory. The speculative loads caused memory cache
faults and system checkstops with SRC B181E540.
- A
problem was fixed to reduce switching noise on the memory address bus
for DIMMs. Noise on the bus could cause a failure for a marginal
DIMM, so this fix has the effect of potentially improving the
reliability of the memory.
|
SC820_070_047 / FW820.11
04/03/15 |
Impact: Function
Severity: SPE
System firmware changes that affect certain systems
- On systems with a
large number of memory DIMMs (64 or more) and redundant service
processors, a problem was fixed for a firmware update failure with SRC
E302F966 when a failover was attempted as part of the firmware update,
but the service processors did not change roles. This also fixes
failing Administrative Failovers (AFOs) for systems with large
memory. The performance of the CEC memory initialization was
improved to prevent the hypervisor time-outs for service processor
failovers.
|
SC820_067_047 / FW820.10
03/12/15 |
Impact: Security
Severity: HIPER
New Features and Functions
- Support for setting Power Management Tuning Parameters from
the management console (Fixed Maximum Frequency (FMF), Idle Power Save,
and DPS Tunables) without needing to use the Advanced System Management
Interface (ASMI) on the service processor. This allows FMF mode
to be set by default without having to modify any tunable parameters
using ASMI.
- Support for SSLv3 has been discontinued to reduce security
vulnerabilities in the secured connections to the service processor.
- Support was added for Single Root I/O Virtualization
(SR-IOV) that enables the hypervisor to share a SR-IOV-capable
PCI-Express adapter across multiple partitions. Two Ethernet adapters
are supported with the SR-IOV NIC capability, when placed in the Power
E880/E870:
• PCIe2 LP 4-port (10Gb FCoE and 1GbE) SR&RJ45
Adapter (#EN0L)
• PCIe2 LP 4-port (10Gb FCoE and 1GbE) SFP+Copper and
RJ4 Adapter (#EN0J)
These adapters each have four ports, and all four ports are enabled
with SR-IOV function. The entire adapter (all four ports) is configured
for SR-IOV or none of the ports is.
System firmware updates the adapter firmware level on these adapters to
10.2.252.16 when a supported adapter is placed into SR-IOV mode.
Support for SR-IOV adapter sharing is not yet available for adapters is
a PCIe Gen3 I/O Expansion Drawer.
SR-IOV NIC on the Power E870/E880 is supported by:
• AIX 6.1 TL9 SP4 and APAR IV63331, or later
• AIX 7.1 TL3 SP4 and APAR IV63332, or later
• IBM i 7.1 TR9, or later
• IBM i 7.2 TR1, or later
• Red Hat Enterprise Linux 6.5, or later
• Red Hat Enterprise Linux 7, or later
• SUSE Linux Enterprise Server 11 SP3, or later
- VIOS
2.2.3.4 with interim fix IV63331, or later
System firmware changes that affect all systems
- HIPER/Pervasive:
A problem was fixed for a processor clock failover with SRC B158CC62
that caused a system checkstop when the backup clock oscillator did not
initialize fast enough.
- A problem was fixed for the iptables process consuming all
available memory, causing an out of memory dump and reset/reload of the
service processor.
- A problem was fixed for a PowerVM hypervisor hang after a
processor core and system checkstop. The failed processor core
was not put into a guarded state and the hypervisor hung when it tried
to use the failed core.
- A problem was fixed for a oscillator error caused by a
power line disturbance that logged an UE SRC B150CC62 with no FRU call
outs. The error SRC was changed from unrecoverable to
informational as no service action is required.
- A problem was fixed for the NEBS DC power supply showing up
in the part inventories for the CEC as "IBM AC PS". The
description string has been changed to "IBM PS" as power supplies can
be of DC or AC type.
- A problem was fixed for the power supplies to add a monitor
process for the second rotor in each power supply that was not being
monitored. This will improve fault isolation for power supply
problems. A fix for the second rotor in an earlier service pack
release provided the monitor infrastructure but was missing the monitor
process.
- A problem was fixed for a FSI link heartbeat surveillance
fault with SRC B1504813 logged that has no FRU call outs. The FRU
call outs have been added.
- A problem was fixed with the Advanced System Management
Interface (ASMI) VPD menu where the Generic External Connector (GC) FRU
was displayed as an unknown FRU type. The "Unknown" has been
replaced with "Generic External Connector".
- A problem was fixed for a system fan identify LED not being
able to light after a Digital Power Systems Sweep (DPSS) chip
failover. The fan LED ownership was not transferred to the new
primary DPSS chip, so it was unable to light the LED under fan fault
conditions.
- A problem was fixed for SRC B1104800 having duplicate FRU
call outs for the PNOR flash FRU.
- A problem was fixed to prevent the Advanced System
Management Interface (ASMI) "System Service Aids/Factory Configuration"
panel option from restoring to factory configuration for FSP or ALL if
one boot side of the service processor is marked invalid. The
following informational message is issued: "The request cannot be
performed because a firmware boot side is marked invalid. This
state may have been caused by a previous firmware update failure."
- A problem was fixed for error log with SRC B150DA19,
created on the backup service processor for a PSI link failure detected
on the primary, not being visible in the error logs on the
primary service processor.
- A problem was fixed in the hardware server to prevent a UE
B181BA07 abort when a host boot dump collection is in progress.
- A problem was fixed for an LED fault with SRC B181A734 that
occurred during a normal rebuild of the LED tables, resulting in the
LED not being lit. The problem has been fixed using retries for
LEDs that are in a busy state.
- A problem was fixed for a PSI link failure with SRC
B1517212 that resulted in a service processor stop state. The
correct state for a system with broken PSI links is the terminate state
so the problem can be resolved with a call home service event.
- A problem was fixed to prevent false oscillator error logs
of SRC B150CC62 for errors unrelated to clock failures.
- A security problem was fixed in OpenSSL for padding-oracle
attacks known as Padding Oracle On Downgraded Legacy Encryption
(POODLE). This attack allows a man-in-the-middle attacker to
obtain a plain text version of the encrypted session data. The Common
Vulnerabilities and Exposures issue number is CVE-2014-3566. The
service processor POODLE fix is implemented by disabling SSL protocol
SSLv3 and requiring TLSv1.2 protocol on all secured connections.
The Hardware Management Console (HMC) also requires a POODLE fix for
APAR MB03867(FIX FOR CVE-2014-3566 FOR HMC V8 R8.2.0 SP1 with PTF
MH01455). This HMC minimum requirement is enforced by the
firmware update process for this defect.
- A problem was fixed for firmware updates that caused the
primary service processor to be guarded and SRC B152E6D0 and SRCs of
form B181XXXX to be logged.
- A problem was fixed for intermittent firmware database
errors that logged an UE SRC of B1818611 and had a fwdbServer core dump.
- A problem was fixed to enable the redundant Vital Product
Data (VPD) SEEPROM for processors and voltage regulator modules
(VRMs). Previously, only the primary SEEPROM was programmed with
the FRU data with no backup protection.
- A problem was fixed for vague error text for SRC B1504922
for a bad SMP cable. It was made more specific to state that an
incorrect cable length was detected.
- A problem was fixed for an intermittent reset/reload of the
service processor during the early part of an IPL with SRC B1814616
logged.
- A problem was fixed for hardware presence detection and
local clock card (LCC) failover. The system could not detect
critical system hardware with th e default LCC missing, causing an
error when failing over to the backup LCC.
- A problem was fixed for non-optimal voltage levels from the
power supplies. Having the power supply output voltages meet the
exact specifications will help prevent stress-related hardware failures.
- A problem was fixed for an error in the "Enlarged IO
Capacity Slot Count" that caused more memory than expected to be
consumed by the hypervisor. If the "Enlarged IO Capacity Slot
Count" was not a "1", it was wrongly changed to an "8" by the IPL
process, increasing the amount of memory that needs to be reserved for
I/O buffers. Retain tip H213684 tells how to reduce the
hypervisor memory consumption when this problem happens as the fix will
not change the value automatically:
With the system at the "Power Off" state, take the following actions to
to free up some memory from the hypervisor:
- Log into ASMI and then select "System Configuration"
menu
- Select "I/O Adapter Enlarged Capacity"
option
- Use the pulldown to select "1" as the new value for all nodes
- After changing the value click on the "Save" setting. The change will
be active on the next IPL of the system.
- A problem was fixed for the PCIe reset line (PERST) to keep
it active during the IPL until both system power and clocks are
stable. Keeping the PCIe devices in reset until the environment
is stable prevents PCIe device lockup.
- A problem was fixed to prevent a hypervisor task failure if
multiple resource dumps running concurrently run out of dump buffer
space. The failed hypervisor task could prevent basic logical
partition operations from working.
- On systems using the Virtual I/O Server (VIOS) to share
physical I/O resources among client logical partitions, a problem was
fixed for memory relocation errors during page migrations for the
virtual control blocks. These errors caused a CEC termination
with SRC B700F103. The memory relocation could be part of the
processing for the Dynamic Platform Optimizer (DPO), Active Memory
Sharing (AMS) between partitions, mirrored memory defragmentation, or a
concurrent FRU repair.
- A problem was fixed that could result in unpredictable
behavior if a memory UE is encountered while relocating the contents of
a logical memory block during one of these operations:
- Reducing the size of an Active Memory Sharing (AMS) pool.
- On systems using mirrored memory, using the memory mirroring
optimization tool.
- Performing a Dynamic Platform Optimizer (DPO) operation.
- A problem was fixed for PCIe link width faults on the
I/O expansion drawer (F/C #EMX0) to only log the SRC B7006A8B once for
each FRU instead of having multiple SRCs and call outs for the same
part.
- A problem was fixed for a wrong state for the PCIe link
LEDs (lit when link has failed) to the I/O expansion drawer with
feature code #EMX0. The fix insures that the link operational
LEDs are not lit when the link to the I/O drawer has failed.
- A problem was fixed for an incorrect SRC of B7006A9F logged
for I/O drawer VPD mismatch during an enclosure serial number update of
the I/O drawer (F/C #EMX0). The incorrect SRC was logged if the
non-primary service path module (right bay) was in a failed state.
- A problem was fixed for a SRC B7006A84 PCIe link down event
not being reported as a failed link for the I/O expansion drawer (F/C
#EMX0) in the PCIe topology status in the Advanced System Manager
Interface (ASMI) or on the management console.
- A problem was fixed for the Live Partition Mobility (LPM)
migration of virtual devices to a Power8 systems to update each virtual
device location code correctly to reflect the location code in the
target systems instead of the location code in the source system.
This problem prevented the management console from being able to look
up AIX Object Data Manager (ODM) names for the virtual devices so that
operations such as remove on the device could not be performed.
- A problem was fixed for PCIe adapters requesting PCI I/O
space that triggers a SRC BA1800007 error log. This SRC should
not have been logged since PC I/O spaces are not supported by Power8
systems. The SRC log is now suppressed.
- A problem was fixed for a processor core unit being
deconfigured but not guarded for a SRC B113E504 processor error in host
boot with fault isolation register (FIR) code
"RC_PMPROC_CHKSLW_NOT_IN_ETR" that caused the CEC to go to
termination. By guarding the failed processor core, the fix
insures the core is not used on the reIPL of the CEC.
- A security problem was fixed in OpenSSL for memory leaks
that allowed remote attackers to cause a denial of service (out of
memory on the service processor). The Common Vulnerabilities and
Exposures issue numbers are CVE-2014-3513 and CVE-2014-3567.
- A security problem in GNU Bash was fixed to prevent
arbitrary commands hidden in environment variables from being run
during the start of a Bash shell. Although GNU Bash is not
actively used on the service processor, it does exist in a library so
it has been fixed. This is IBM Product Security Incident Response
Team (PSIRT) issue #2211. The Common Vulnerabilities and
Exposures issue numbers for this problem are CVE-2014-6271,
CVE-2014-7169, CVE-2014-7186, and CVE-2014-7187.
- A problem was fixed to add failure recovery in the early
boot of the service processor so that the boot is retried on failure
instead of the service processing going unresponsive with SRC B1817212
on the operations panel.
- A problem was fixed for isolating and repairing DIMM memory
failures at the byte level without affecting other ranks of memory.
This fix substantially reduces the FRU call outs of DIMMS for memory
problems.
- A security problem was fixed in OpenSSL where the service
processor would, under certain conditions, accept Diffie-Hellman client
certificates without the use of a private key, allowing a user to
falsely authenticate . The Common Vulnerabilities and Exposures
issue number is CVE-2015-0205.
- A security problem was fixed in OpenSSL to prevent a denial
of service when handling certain Datagram Transport Layer Security
(DTLS) messages. A specially crafted DTLS message could exhaust
all available memory and cause the service processor to reset.
The Common Vulnerabilities and Exposures issue number is CVE-2015-0206.
- A security problem was fixed in OpenSSL to prevent a denial
of service when handling certain Datagram Transport Layer Security
(DTLS) messages. A specially crafted DTLS message could do an
null pointer de-reference and cause the service processor to
reset. The Common Vulnerabilities and Exposures issue number is
CVE-2014-3571.
- A security problem was fixed in OpenSSL to fix multiple
flaws in the parsing of X.509 certificates. These flaws could be
used to modify an X.509 certificate to produce a certificate with a
different fingerprint without invalidating its signature, and possibly
bypass fingerprint-based blacklisting. The Common Vulnerabilities
and Exposures issue number is CVE-2014-8275.
- A security vulnerability, commonly referred to as GHOST,
was fixed in the service processor glibc functions getbyhostname() and
getbyhostname2() that allowed remote users of the functions to cause a
buffer overflow and execute arbitrary code with the permissions of the
server application. There is no way to exploit this vulnerability
on the service processor but it has been fixed to remove the
vulnerability from the firmware. The Common Vulnerabilities and
Exposures issue number is CVE-2015-0235.
- A problem was fixed for an incorrect SRC logged for an
unplugged cable to the PCIe I/O expansion drawer (F/C #EMX0). A
B7006A88 SRC was errantly logged that calls out the cable as bad
hardware that needs to be replaced. This is replaced with SRC
B7006A82 that says a cable is unplugged to a PCIe FanOut module in the
IO expansion drawer.
- A problem was fixed for missing dump data for cores and L3
cache memory when there is core checkstop and deconfiguration of the
core.
- A problem was fixed for a false power supply fan failure
with SRC 1100152F. If the AC was interrupted to the power supply,
the SRC 11001525 would have been logged for a bad fan with a call out
of the power supply for replacement.
- A problem was fixed for a partition deletion error on the
management console with error code 0x4000E002 and message
"...insufficient memory for PHYP". The partition delete operation
has been adjusted to accommodate the temporary increase in memory usage
caused by memory fragmentation, allowing the delete operation to be
successful.
- A problem was fixed for disruptive firmware update to
prevent false reference clock failures with SRC B1814805 and a hang in
the IPL for the CEC.
- A problem was fixed for a memory leak associated with the
logging of SRC B1561311 for a bad voltage regulator module (VRM).
- A problem was fixed for the processor module replacement
process to prevent VPD corruption on the primary and redundant VPD
chips on the new processor module. This corruption resulted in
the processor being unusable with HostBoot failing with unrecoverable
errors (UEs) of SRCs BC8A090F and BC8A1701.
System firmware changes that affect certain systems
- HIPER/Pervasive:Deferred:
On a system configured for a large number of PCIe adapters across
multiple PCIe I/O expansion drawers (F/C #EMX0), a problem was fixed so
that the PCIe adapters worked correctly in the system.
Previously, the PCIe interrupt servicing could deadlock, causing the
PCIe adapter cards to become unresponsive.
- For a system with Virtual Trusted Platform Module (VTPM)
partitions, a problem was fixed for a management console error
that occurred while restoring a backup profile that caused the system
to to go the management console "Incomplete state". The failed
system had a suspended VTPM partition and a B7000602 SRC logged.
- For systems with IBMi partitions, a problem was fixed for
the "5250 Application Capable" capability so it is passed to the IBMi
partition as "True" if purchased. For the problem, the capability
was not sent to the partition and could cause extra performance to be
missing for the "Fast Green Screen Performance" feature in IBMi.
There is a delay of up to 15 minutes after this fix is installed before
it becomes active on the system. If the updated capability
property does not show up in the management console CEC properties as
"True", this is a slowness in the refresh of the capability properties
to the management console and not a problem with the fix. To
resolve this issue with the capability not displaying correctly,
rebuild the managed system on the management console and then wait up
to one hour for the CEC property capability "5250 Application Capable"
to be updated to "True".
- On a system with a Linux partition, a problem was fixed for
the Linux "lsslot" command so that it is able to find the F/C EC41
and EC42 PCIe 3D graphics adapter installed in the CEC, instead of
showing the slot as "empty". The Linux graphics adapter worked
correctly even though it showed as "empty".
- On systems with a PCIe 3D graphics adapter (F/C #EC41 or
#EC42) in a partition, a problem was fixed for a partition hang or
BA21xxxx error conditions during partition initialization.
- A problem was fixed for certain workloads that caused the
system to enter safe mode (mode for running at minimum processor
frequencies) when the On-chip controllers (OCCs) did not get the
Analog Power Subsystem Sweep (APSS) frequency control data within
the OCC time out period. The time out for a OCC update has been
increased so the OCC can tolerate periods of high bus use that slow
down the APSS communication.
- On a system with redundant service processors, a problem
was fixed for bad pointer reference in the mailbox function during data
synchronization between the two service processors. The
de-reference of the bad pointer caused a core dump, reset/reload, and
fail-over to the backup service processor.
|
SC820_051_047 / FW820.03
01/27/15 |
Impact: Serviceability
Severity: SPE
System firmware changes that affect all systems
- A problem was fixed in concurrent firmware update to
prevent the secondary service processor from going to a failed state.
- A problem was fixed for the power supply fans to monitor
both rotors instead of one to prevent a failure in one rotor from
shutting down the power supply.
- A problem was fixed for firmware updates to reduce the
number of informational B181A85E SRCs for an expected SQL lock
condition during a database transaction. Previously, several
thousand B181A85E SRC entries were created for the error log, slowing
performance of the service processor and flooding the error log.
- A problem was fixed for reset/reload failures caused by
excessive synchronization of thermal management data with the redundant
service processor.
- A problem was fixed for failovers to the secondary service
processor failing with SRC B1818601 caused by a bad data base object
reference.
System firmware changes that affect certain systems
- For a system with memory mirroring activated and a memory
block size of 16 Megabytes, a problem was fixed for system dump that
caused Hypervisor Real Mode Offset (HMRO) data structure corruption in
the physical memory map. This problem could cause
concurrent firmware update failures or subsequent system dumps to be
corrupted.
|
SC820_048_047 / FW820.02
12/01/14 |
Impact:
New
Severity: New
New Features and Functions
|