AS730
For Impact, Severity and other Firmware definitions, Please
refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs
The complete Firmware Fix History for this
Release Level can be
reviewed at the following url:
http://download.boulder.ibm.com/ibmdl/pub/software/server/firmware/AS-Firmware-Hist.html
|
AS730_163_093
/ FW731.77
04/01/16
|
Impact: Security
Severity: ATT
System firmware changes that affect all systems
- A problem was fixed for logical partitions not booting
after replacement of both DCCAs and service processors in the service
drawer. If the service processors contained incorrect topology
data, it is not recalculated, causing bad route information and a hang
when booting the partitions. With the fix, the Local Network
Management Controller (LNMC) does a recalculation for the topology when
both service processors are replaced, allowing the partitions to boot
successfully.
- A problem was fixed for the Integrated Switch Network
Manager (ISNM) performance counter output having an incorrect Global
Counter timestamp value. Without the fix, the global counter
value is filled with the local GC ID.
- A security problem was fixed in the lighttpd server on the
service processor, where a remote attacker, while attempting
authentication, could insert strings into the lighttpd server log
file. Under normal operations on the service processor, this does
not impact anything because the log is disabled by default. The
Common Vulnerabilities and Exposures issue number is CVE-2015-3200.
- A problem was fixed for reporting all optical link UE
errors through the Local Network Management Controller (LNMC).
Without the fix, some of the errors are hidden from the LNMC reports
because a threshold count for the error must be exceeded before it is
reported to the LNMC. Even though some of the errors are hidden
from the LNMC, they are all visible in the service processor error log.
- On the BPC, a problem was fixed for the remote hardware
vitals (rvitals) command returning an incorrect input voltage when
there are failed Bulk Power Regulators (BPRs) on the line cord.
With the fix, the BPC reports the highest valid value from BPRs,
instead of averaging the voltages.
- On the BPC, a problem was fixed for the remote hardware
vitals (rvitals) command returning old (stale) power usage numbers for
CECs that are deactivated when the power usage should be zero.
With the fix, the deactivated CECs show zero power usage.
- A security problem was fixed in OpenSSL for a possible
service processor reset on a null pointer de-reference during RSA PPS
signature verification. The Common Vulnerabilities and Exposures issue
number is CVE-2015-3194.
|
AS730_158_093
/ FW731.76
10/25/15
|
Impact: Security
Severity: SPE
System firmware changes that affect all systems
- A security problem was fixed in OpenSSL where a remote
attacker could crash the service processor with malformed Elliptic
Curve private keys. The Common Vulnerabilities and Exposures
issue number is CVE-2015-0209.
- A security problem was fixed in OpenSSL where a remote
attacker could crash the service processor with a specially crafted
X.509 certificate that causes an invalid pointer, out-of-bounds write,
or a null pointer de-reference. The Common Vulnerabilities and
Exposures issue numbers are CVE-2015-0286, CVE-2015-0287, and
CVE-2015-0288.
- A security problem was fixed for an OpenSSL specially
crafted X.509 certificate that could cause the service processor to
reset in a denial-of-service (DOS) attack. The Common
Vulnerabilities and Exposures issue number is CVE-2015-1789.
- A problem was fixed for a stop condition in the processing
of the Host Fabric Interface (HFI) broadcast traffic that resulted in
network boots failing for cluster nodes. This problem is
intermittent and requires heavy HFI traffic to cause the error.
To help reduce this problem, a staggered IPL of the nodes can be
used in a large cluster instead of a simultaneous IPL.
- A problem was fixed for the bulk power controller (BPC) not
being able to connect to a service processor with Security Mode set to
"SSLv3 Disabled". The Advanced System Management Interface (ASMI)
is used to change the Security Mode to "SSLv3 Disabled". This
highest level of security protection does not allow service processor
clients to connect using the SSLv3 protocol.
|
AS730_155_093
/ FW731.75
09/15/15
|
Impact: Availability
Severity: SPE
New Features and Functions
- For water cooled systems, the water flushing was enhanced
to ensure that the water in the primary side pipes is fresh and
accurately cold if the current water temperature is reading high
compared to the lowest facility water temperature.
- A security enhancement was made to prevent unsecured
connections to the PTLIC Monitor. The BPC service processor must be
logged into first now before the user can access the PTLIC Monitor.
System firmware changes that affect all systems
- A problem was fixed for a SRC 14020059 reported against the
Motor Drive Assembly (MDA) card in the BPC.
System firmware changes that affect certain systems
- For systems with large clusters, a problem was fixed for
Local Network Management Controller (LNMC) network time-outs during a
simultaneous IPL of the entire cluster. The LNMC network response
was improved by optimizing its internal tracing to make it more
efficient.
|
AS730_153_093
/ FW731.74
06/26/15
|
Impact: Security
Severity: SPE
System firmware changes that affect all systems
- A security problem
was fixed in OpenSSL for padding-oracle attacks known as Padding Oracle
On Dowgraded Legacy Encryption (POODLE). This attack allows a
man-in-the-middle attacker to obtain a plain text version of the
encrypted session data. The Common Vulnerabilities and Exposures issue
number is CVE-2014-3566. The service processor POODLE fix is
based on a selective disablement of SSLv3 using the Advanced System
Management Interface (ASMI) "System Configuration/Security
Configuration" menu options. The Security Configuration options
of "Disabled", "Default", and "Enabled" for SSLv3 determines the level
of protection from POODLE. The management console also requires a
POODLE fix for APAR MB03838 (FIX FOR CVE-2014-3566 FOR HMC V7 R7.3.0
SP7 (PTF MH01456) ) to eliminate all vulnerability to POODLE and allow
use of option 1 "Disabled" as shown below. This HMC minimum
requirement is enforced by the firmware update process for this defect.
The POODLE fix also addresses a vulnerability commonly referred to as
"Bar Mitzvah Attack" with CVE-2015-2808. The RC4 cipher algorithm, as
used in the TLS protocol and SSL protocol, could allow a remote
attacker to obtain sensitive information. The use of the RC4
cipher has been discontinued.
-1) Disabled: This highest level of security protection does not
allow service processor clients to connect using SSLv3, thereby
eliminating any possibility of a POODLE attack. All clients must
be capable of using TLS to make the secured connections to the service
processor to use this option.
-2) Default: This medium level of security protection disables
SSLv3 for the web browser sessions to ASMI and for the CIM clients and
assures them of POODLE-free connections. But the legacy
management consoles are allowed to use SSLv3 to connect to the service
processor. This is intended to allow non-POODLE compliant HMC
levels to be able to connect to the CEC servers until they can be
planned and upgraded to the POODLE compliant HMC levels. Running
a non-POODLE compliant HMC to a service processor in "Default"
mode will prevent the ASMI-proxy sessions from the HMC from connecting
as these proxy sessions require SSLv3 support in ASMI.
-3) Enabled: This basic level of security protection enables
SSLv3 for all service processor client connections. It relies on
all clients being at POODLE fix compliant levels to provide full POODLE
protection using the TLS Fallback Signaling Cipher Suite Value
(TLS_FALLBACK_SCSV) to prevent fallback to vulnerable SSLv3
connections. This option is intended for customer sites on
protected internal networks that have a large investment in legacy
hardware that need SSLv3 to make browser and HMC connections to the
service processor. The level of POODLE protection actually
achieved in "Enabled" mode is determined by the percentage of clients
that are at the POODLE fix compliant levels.
- A security problem was fixed in the OpenSSL (Secure Socket
Layer) protocol that allowed a man-in -the middle attacker, via a
specially crafted fragmented handshake packet, to force a TLS/SSL
server to use TLS 1.0, even if both the client and server supported
newer protocol versions. The Common Vulnerabilities and Exposures issue
number for this problem is CVE-2014-3511.
- A security problem was fixed in OpenSSL for formatting
fields of security certificates without null-terminating the output
strings. This could be used to disclose portions of the program
memory on the service processor. The Common Vulnerabilities and
Exposures issue number for this problem is CVE-2014-3508.
- Multiple security problems were fixed in the way that
OpenSSL handled Datagram Transport Layer Security (DLTS) packets.
A specially crafted DTLS handshake packet could cause the service
processor to reset. The Common Vulnerabilities and Exposures
issue numbers for these problems are CVE-2014-3505, CVE-2014-3506 and
CVE-2014-3507.
- A security problem was fixed in OpenSSL to prevent a denial
of service when handling certain Datagram Transport Layer Security
(DTLS) ServerHello requests. A specially crafted DTLS handshake
packet with an included Supported EC Point Format extension could cause
the service processor to reset. The Common Vulnerabilities and
Exposures issue number for this problem is CVE-2014-3509.
- A security problem was fixed in OpenSSL to prevent a denial
of service by using an exploit of a null pointer de-reference during
anonymous Diffie Hellman (DH) key exchange. A specially crafted
handshake packet could cause the service processor to reset. The
Common Vulnerabilities and Exposures issue number for this problem is
CVE-2014-3510.
- A security problem was fixed in OpenSSL for memory leaks
that allowed remote attackers to cause a denial of service (out of
memory on the service processor). The Common Vulnerabilities and
Exposures issue numbers are CVE-2014-3513 and CVE-2014-3567.
- A problem was fixed for intermittent B181EF88 SRCs and
netsSlp core dumps during network configurations on the service
processor. This error caused call home activity for the SRC and
dumps but otherwise had no impact to the CEC functionality.
- A problem was fixed for the Integrated Switch Network
Manager (ISNM) that caused it to put many Integrated Switch Routers
(ISRs) in the cluster into a non-functional state if all the drawers of
the HPC CEC were rebooted simultaneously.
- A security problem was fixed in OpenSSL where the service
processor would, under certain conditions, accept Diffie-Hellman client
certificates without the use of a private key, allowing a user to
falsely authenticate . The Common Vulnerabilities and Exposures
issue number is CVE-2015-0205.
- A security problem was fixed in OpenSSL to prevent a denial
of service when handling certain Datagram Transport Layer Security
(DTLS) messages. A specially crafted DTLS message could exhaust
all available memory and cause the service processor to reset.
The Common Vulnerabilities and Exposures issue number is CVE-2015-0206.
- A security problem was fixed in OpenSSL to prevent a denial
of service when handling certain Datagram Transport Layer Security
(DTLS) messages. A specially crafted DTLS message could do an
null pointer de-reference and cause the service processor to
reset. The Common Vulnerabilities and Exposures issue number is
CVE-2014-3571.
- A security problem was fixed in OpenSSL to fix multiple
flaws in the parsing of X.509 certificates. These flaws could be
used to modify an X.509 certificate to produce a certificate with a
different fingerprint without invalidating its signature, and possibly
bypass fingerprint-based blacklisting. The Common Vulnerabilities
and Exposures issue number is CVE-2014-8275.
- A security vulnerability, commonly referred to as GHOST,
was fixed in the service processor glibc functions getbyhostname() and
getbyhostname2() that allowed remote users of the functions to cause a
buffer overflow and execute arbitrary code with the permissions of the
server application. There is no way to exploit this vulnerability
on the service processor but it has been fixed to remove the
vulnerability from the firmware. The Common Vulnerabilities and
Exposures issue number is CVE-2015-0235.
- On systems with redundant service processors, a
problem was fixed so that a backup memory clock failure with SRC
B120CC62 is handled without terminating the system running on the
primary memory clock.
- A problem was fixed in the Advanced System Management
Interface (ASMI) to reword a confusing message for systems with no
deconfigured resources. The "System Service Aids/Deconfiguration
Records" message text for this situation was changed from
"Deconfiguration data is currently not available." to "No deconfigured
resources found in the system.
- On a system with redundant service processors, a problem
was fixed for bad pointer reference in the mailbox function during data
synchronization between the two service processors. The
de-reference of the bad pointer caused a core dump, reset/reload, and
fail-over to the backup service processor.
- A problem was fixed with the fspremote service tool to make
it support TLSv1.2 connections to the service processor to be
compatible with systems that had been fixed for the OpenSSL Padding
Oracle On Dowgraded Legacy Encryption (POODLE) vulnerabilities.
After the POODLE fix is installed, by default the system only allows
secured connections from clients using the TLSv1.2 protocol.
- The Avago firmware for the optical transmitters was updated
to the 0B.41 level that fixed a problem in the 0B.31 level, where
certain lasers that were partially degraded were completely turned off
by the 0B.31 firmware before their effective usability lifetime was
completely finished. The 0B.41 firmware will keep the lasers operating
as long as they are able to transmit data in an error-free manner.
System firmware changes that affect certain systems
- On systems with large clusters, a problem was fixed
for optical link failures when simultaneously booting all CECs of the
cluster. Links may be left in the state of "DOWN_RECV_GOOD",
which means a port on one side of a optical link did not report a state
of link "up".
|
AS730_142_093
/ FW731.73
10/17/14
|
Impact: Availability
Severity: ATT
System firmware changes that affect all systems
- A problem was fixed
for a net session entry file lock error that prevented the management
console from connecting to the service processor.
|
AS730_141_093
/ FW731.72
09/08/14
|
Impact: Security
Severity: SPE
System firmware changes that affect all systems
- A problem was fixed
for I/O adapters so that BA400002 errors were changed to informational
for memory boundary adjustments made to the size of DMA map-in
requests. These size adjustments were marked as UE previously for
a condition that is normal.
- A security problem was fixed for the Lighttpd web
server that allowed arbitrary SQL commands to be run on the service
processor. The Common Vulnerabilities and Exposures issue number
is CVE-2014-2323.
- A security problem was fixed for the Lighttpd web server
where improperly-structured URLs could be used to view arbitrary files
on the service processor. The Common Vulnerabilities and
Exposures issue number is CVE-2014-2324.
- A security problem was fixed in the service processor
TCP/IP stack to discard illegal TCP/IP packets that have the SYN and
FIN flags set at the same time. An explicit packet discard was
needed to prevent further processing of the packet that could result in
an bypass of the iptables firewall rules.
- A security problem was fixed for the Network Time Protocol
(NTP) client that allowed remote attackers to execute arbitrary code
via a crafted packet containing an extension field. The Common
Vulnerabilities and Exposures issue number is CVE-2009-1252.
- A security problem was fixed for the Network Time Protocol
(NTP) client for a buffer overflow that allowed remote NTP servers to
execute arbitrary code via a crafted response. The Common
Vulnerabilities and Exposures issue number is CVE-2009-0159.
System firmware changes that affect certain systems
- On a system with a disk device with multiple boot
partitions, a problem was fixed that caused System Management Services
(SMS) to list only one boot partition. Even though only one boot
partition was listed in SMS, the AIX bootlist command could still be
used to boot from any boot partition.
|
AS730_140_093
/ FW731.71
08/21/14
|
Impact: Security
Severity: HIPER
System firmware changes that affect all systems
- HIPER/Pervasive:
A security problem was fixed in the OpenSSL (Secure Socket Layer)
protocol that allowed clients and servers, via a specially crafted
handshake packet, to use weak keying material for communication.
A man-in-the-middle attacker could use this flaw to decrypt and modify
traffic between the management console and the service processor.
The Common Vulnerabilities and Exposures issue number for this problem
is CVE-2014-0224.
- HIPER/Pervasive:
A security problem was fixed in OpenSSL for a buffer overflow in the
Datagram Transport Layer Security (DTLS) when handling invalid DTLS
packet fragments. This could be used to execute arbitrary code on
the service processor. The Common Vulnerabilities and Exposures
issue number for this problem is CVE-2014-0195.
- HIPER/Pervasive:
Multiple security problems were fixed in the way that OpenSSL handled
read and write buffers when the SSL_MODE_RELEASE_BUFFERS mode was
enabled to prevent denial of service. These could cause the
service processor to reset or unexpectedly drop connections to the
management console when processing certain SSL commands. The
Common Vulnerabilities and Exposures issue numbers for these problems
are CVE-2010-5298 and CVE-2014-0198.
- HIPER/Pervasive:
A security problem was fixed in OpenSSL to prevent a denial of service
when handling certain Datagram Transport Layer Security (DTLS)
ServerHello requests. A specially crafted DTLS handshake packet could
cause the service processor to reset. The Common Vulnerabilities
and Exposures issue number for this problem is CVE-2014-0221.
- HIPER/Pervasive:
A security problem was fixed in OpenSSL to prevent a denial of service
by using an exploit of a null pointer de-reference during anonymous
Elliptic Curve Diffie Hellman (ECDH) key exchange. A specially
crafted handshake packet could cause the service processor to
reset. The Common Vulnerabilities and Exposures issue number for
this problem is CVE-2014-3470.
- Help text for the Advanced System Management Interface
(ASMI) "System Configuration/Hardware Deconfiguration/Clear All
Deconfiguration Errors" menu option was enhanced to clarify that when
selecting "Hardware Resources" value of "All hardware resources", the
service processor deconfiguration data is not cleared.
The "Service processor" must be explicitly selected for that to be
cleared.
- A problem was fixed that prevented guard error logs from
being reported for FRUs that were guarded during the system power
on. This could happen if the same FRU had been previously
reported as guarded on a different power on of the system. The
requirement is now met that guarded FRUs are logged on every power on
of the system.
|
AS730_138_093
/ FW731.70
05/09/14
|
Impact: Availability
Severity: SPE
New Features and Functions
- Support was dropped for Secured Socket Layer (SSL) Version
2 and SSL weak and medium cipher suites in the service processor web
server (Lighttpd). Unsupported web browser connections to the
Advanced System Management Interface (ASMI) secured port 443 (using
https://) will now be rejected if those browsers do not support SSL
version 3. Supported web browsers for Power7 ASMI are Netscape
(version 9.0.0.4), Microsoft Internet Explorer (version 7.0), Mozilla
Firefox (version 2.0.0.11), and Opera (version 9.24).
System firmware changes that affect all systems
- A problem was fixed that prevented the service processor
from recognizing the I/O hub Host Fabric Interface (HFI) and Collective
Acceleration Unit (CAU) components as valid functional units
(FUs). This caused guard reports to show "Invalid FU" as
the hardware type of the components along with an incorrect
"DECONFIGURED" call out hardware state.
- A problem was fixed that caused system memory to guarded
when service processor errors on the FRU Support Interface (FSI)
occurred.
- A problem was fixed that caused a flood of predictive error
(PE) logs with SRC B181E550 for Integrated Switch Router (ISR) chip
recoverable errors. The errors are logged by the service
processor PRD component with signature description "io(n0p0) Undefined
error code" but there is no hardware guarded.
- A problem was fixed that caused a service processor dump to
be generated with SRC B18187DA "NETC_RECV_ER" logged.
- A problem was fixed that caused a SRC B1754201 predictive
error to be logged without call out actions. Missing call outs
were added for bus errors accessing the Torrent chip.
- A problem was fixed that could block Host Fabric Interface
(HFI) array error recovery and eventually lead to a double bit error,
which would cause the HFI to become unusable until the next system
reboot.
- A problem was fixed that caused an error log generated by
the partition firmware to show conflicting firmware levels. This
problem occurs after a firmware update or a logical partition migration
(LPM) operation on the system.
- A problem was fixed in the isolation of PCI faults for
stopped clocks so that the error would not cause a system-wide
failure. The error is now limited to the affected logical
partition (LPAR).
- A problem was fixed that caused a L2 cache error to not
guard out the faulty processor, allowing the system to checkstop again
on an error to the same faulty processor.
- A problem was fixed that caused a HMC code update failure
for the FSP on the accept operation with SRC B1811402 or FSP is unable
to boot on the updated side.
- DEFERRED: A problem
was fixed that caused a system checkstop during hypervisor time keeping
services. This deferred fix addresses a problem that has a very low
probability of occurrence. As such customers may wait for the
next planned service window to activate the deferred fix via a system
reboot.
- A problem was fixed that caused a lose of Time of Day (TOD)
clock redundancy after a power repair of a Distributed Conversion and
Control Assembly (DCCA). After the DCCA repair, the primary and
secondary TOD were assigned to the same oscillator in the DCCA that
never lost power, even though both system oscillators were functional.
- A problem was fixed that caused the system attention LED to
be lit without a corresponding SRC and error log for the event.
This problem typically occurs when an operating system on a partition
terminates abnormally.
- DEFERRED: A problem
was fixed that caused a system checkstop with SRC B113E504 for a
recoverable hardware fault. This deferred fix addresses a problem
that has a very low probability of occurrence. As such customers
may wait for the next planned service window to activate the deferred
fix via a system reboot.
System firmware changes that affect certain systems
- On systems running AIX or Linux, a problem was fixed that
caused a partition to fail to boot with SRC CA260203. This
problem also can cause concurrent firmware updates to fail.
- On systems using IPv6 addresses, the firmware was enhanced
to reduce the time it take to install an operating system using the
Network Installation Manager (NIM).
- On a partition with a large number of potentially bootable
devices, a problem was fixed that caused the partition to fail to boot
with a default catch, and SRC BA210000 may also be logged.
- On systems in a high-performance computing (HPC) B-side
cluster with an 8D_2S cross-coupled topology, a problem in the Local
Network Management Controller (LNMC) was fixed that caused distance
link (D-link) virtual channel (VC) deadlocks when using indirect
routes. Secondary routes had been erroneously included in the
indirect route chain. For this problem, the Executive Manager
Server (EMS) will repeatedly log "VC Deadlock Error" messages
into the /var/opt/isnm/cnm/logs/EVT_SUM.log
- A problem was fixed in the run-time abstraction services
(RTAS) extended error handling (EEH) for fundamental reset that caused
partitions to crash during adapter updates. The fundamental reset
of adapters now returns a valid return code. The adapter drivers
using fundamental reset affected by this fix are the following:
o QLogic PCIe Fibre Channel adapters (combo card)
o IBM PCIe Obsidian
o Emulex BE3-based ethernet adapters
o Broadcom-based PCIe2 4-port 1Gb ethernet
o Broadcom-based FlexSystem EN2024 4-port 1Gb ethernet for compute nodes
- On systems with a DIMM error, a problem was fixed in
the service processor memory diagnostic that caused the
de-configuration of all memory. The memory diagnostic had failed
all the memory due to special attention flooding caused by the bad
hardware that did not allow the memory diagnostic to
complete. With the special attention flooding prevented,
the memory diagnostic is now able to isolate the DIMM error to a FRU
location and guard it so the system is able to IPL.
|
AS730_130_093
/ FW731.61
10/25/13
|
Impact: Availability
Severity: SPE
System firmware changes that affect certain systems
- On systems in a
high-performance computing (HPC) B-side cluster with an 8D_2S
cross-coupled topology, a problem in the Local Network Management
Controller (LNMC) was fixed that caused distance link (D-link) virtual
channel (VC) deadlocks when using indirect routes. Secondary
routes had been erroneously included in the indirect route chain.
For this problem, the Executive Manager Server (EMS) will repeatedly
log "VC Deadlock Error" messages into the
/var/opt/isnm/cnm/logs/EVT_SUM.log
|
AS730_125_093
03/11/13
|
Impact: Availability
Severity: SPE
System firmware changes that affect all systems
- A problem was
fixed that caused SRC B1813221, which indicates a failure of the
battery on the service processor, to be erroneously logged after a
service processor reset or power cycle.
- A problem was fixed that caused various SRCs to be
erroneously logged at boot time including B181E6C7 and B1818A14.
- A problem was fixed that caused a system to abnormally
terminate due to a null pointer reference.
- The firmware was enhanced to reduce "sender hang" errors
and failures to boot nodes via the cluster fabric.
System firmware changes that affect certain systems
- On large clusters, a problem was fixed that caused some
links in the system to remain permanently in the DOWN_RECV_GOOD
state. The links in question will not be fully utilized for data
transmission. The problem occurs with regular frequency on large
clusters when re-IPLing all CECs in the system.
|
AS730_118_093
11/02/12
|
Impact: Function
Severity: SPE
System firmware changes that affect all systems
- DEFERRED: A problem
was fixed that could cause a live lock on the power bus resulting in a
system crash.
- The firmware was
enhanced to increase the
performance of certain applications by updating the routing tables.
- A problem was fixed that
caused a segmentation fault in the service processor firmware.
When this occurred, a PERC error with SRC B181C350 was logged.
- On systems on which
Internet Explorer (IE) is used to access the Advanced System Management
Interface (ASMI) on the Hardware Management Console (HMC), a problem
was fixed that caused IE to hang for about 10 minutes after saving
changes to network parameters on the ASMI.
- A problem was fixed that
caused the gateway network address to be shown incorrectly on the
System Management Services (SMS) menus when booting a partition on an
iSCSI network.
- A problem was fixed that
caused a "code accept" during a concurrent firmware installation from
the HMC to fail with SRC E302F85C.
- On storage drawers in a
cross-coupled topology, an attempt to place an indirect (failover)
route at an SNID location in the SRT1 route table may result in a
failover route that uses the opposite compute sub-cluster as a bounce
point. The firmware was enhanced to prevent this, since there are
no physical links between the two compute sub-clusters in a
cross-coupled topology. Having a failover route through the
opposite compute sub-cluster will lead to packet loss and application
failure.
- A problem was fixed that
prevented predictive guard errors from being deleted on the secondary
service processor. This caused hardware to be erroneously guarded
out if a service processor failover occurred.
- A problem was fixed that
caused the service processor to be reset during a CEC power off or
reboot. This causes the system to terminate, followed by a
platform reboot. SRC B181E6C7 is typically logged when this
problem occurs.
- A problem was fixed that
caused a system crash with unrecoverable SRC B7000103 and
"ErFlightRecorder" in the failing stack.
- A problem was fixed that
caused the following symptoms on user-level jobs:
1. During job initialization when starting communication
over the cluster fabric, an error message similar to the following:
4:ERROR 629
fD4fs: Message type 21 from source 4 4:MPI-PAMI ERROR: pami_init()
failed with rc(1) 4:ERROR: 0031-309 Connect failed during message
passing
initialization, task 4, reason:
2. The initialization may succeed, but an HFI translation
failure may occur, causing a time out on the cluster network and other
side effects.
System firmware changes that affect certain systems
- A problem was fixed that caused the dual-port Ethernet
adapter, F/C 5270 and F/C 5708, to fail to power on with SRC B7006970.
- On systems in a high-performance computing (HPC) cluster in
8D topology, a problem was fixed that caused a secondary route to be
linked to an indirect route chain. Jobs that are run in indirect
route mode may experience hangs and performance problems.
- The firmware was enhanced to improve the performance when
indirect routing is used in large cluster systems.
|
AS730_103_093
06/27/12
|
Impact: Availability
Severity: SPE
System firmware changes that affect all systems
- A problem was fixed that caused a
segmentation fault in the service processor firmware. When this
occurred, a perc error with SRC B181C350 was logged.
System firmware changes that affect certain systems
- On nodes with a single DCCA running AS730_093, a problem
was fixed that prevented the node from booting, with SRC 10008732
erroneously logged.
|
AS730_093_093
06/13/12
|
Impact: Serviceability
Severity: SPE
System firmware changes that affect all systems
- DEFERRED: The firmware was enhanced to fix a
potential performance degradation on systems utilizing the stride-N
stream prefetch instructions dcbt (with TH=1011) or dcbtst (with
TH=1011). Typical applications executing these algorithms include
High Performance Computing, data intensive applications exploiting
streaming instruction prefetchs, and applications utilizing the
Engineering and Scientific Subroutine Library (ESSL) 5.1.
- The firmware
was enhanced to correctly handle bus errors between the P7 processor
chip and the I/O hub chip.
- The firmware was enhanced to correctly diagnose the failing
FRU when SRC B1xxE504 with error signature "MCFIR[14] - Hang timer
detector" was logged.
- The firmware was enhanced to improve the FRU callouts when
the number of multi-bit errors on a POWER7 processor bus exceeds the
threshold. This reduces the number of FRUs replaced on a failing
system.
- A problem was fixed the caused a system to crash when the
system was in low power (or safe mode), and the system attempted to
switch over to nominal mode.
- The firmware was enhanced to reduce the impact of heavy
volume errors, which can be logged as "sender hang" errors.
- The firmware was enhanced to reduce the number of "retry
fetch CE" and "DRAM spare" error logs entries that call out memory
DIMMs.
- A problem was fixed that caused the first processor module
in a node to be erroneously called out if an over-temperature condition
was detected, instead of the processor module that was reporting the
over-temperature condition.
- The firmware was enhanced to handle the I/O hub ISR
(Integrated Switch Router) link port errors as software-recoverable,
rather than as hard failures. Before this enhancement, the links
would have been guarded out even though these errors were recoverable.
- A problem was fixed that caused a service processor kernel
panic due to an out-of-memory condition, with SRC B181720D.
System firmware changes that affect certain systems
- On systems with F/C 5708 and 5270 Dual port 10GB Ethernet
adapter cards installed, a problem was fixed that caused SRC B7006970
to be erroneously logged when the card was powered on.
- In asymmetric and cross-coupled topologies, if there are no
direct dlink connections between a storage drawer and a compute
supernode (either through fail-in-place or through having a compute
drawer or drawers at standby), then the storage drawer, upon restart or
re-initialization of the lnmc daemon (lnmcd), does not provide a
failover route to the target compute supernode even though there are
suitable bounce points within the compute sub-cluster that can provide
the indirect route. The firmware was enhanced to provide this
indirect route.
|
AS730_084_084
04/12/12
|
Impact: Function
Severity: SPE
New Features and Functions
- Support for cross-coupled compute-to-storage topology for a
2 drawer storage sub-cluster.
- Support for cross-coupled compute-to-storage topology for a
4 drawer storage sub-cluster.
System firmware changes that affect all systems
- The firmware was enhanced to allow a node to continue to
boot when unrecoverable SRC B181B70C is logged.
- A problem was fixed that caused an extraneous error log
entry
calling out DCCA-B and hub R5 when power was removed from DCCA-A, and
the service processor and TPMD in DCCA-A were primary.
- The firmware was enhanced to more gracefully handle the
system
shutdown that is required when a hypervisor hang condition was
encountered. SRCs B7000602, B182951C, B1813918 and A7001151 were
logged, and a service processor failover occurred, when the hypervisor
hang condition and subsequent system crash occurred.
- The firmware was enhanced to cause the secondary service
processor to automatically pick up configuration changes stored on the
primary service processor. This prevents the new configuration
information from being lost if a service processor failover occurs
before the secondary has picked up the new configuration information;
typically this problem will only be encountered just after a system is
installed.
- The firmware was enhanced to gracefully recover, and log
the correct error logs, if the secondary DCCA loses power.
- A problem was fixed that prevented communication between
the
compute and storage networks in asymmetric ISR network
topologies.
This affected network topologies DD2_64_8_2A, DD2_64_8_2B, DD2_64_8_4A,
and DD2_64_8_4B.
- A problem was fixed that caused SRC B181E6F1
("RMGR_PERSISTENT_EVENT_TIMEOUT") to be erroneously logged.
- The firmware was enhanced to reduce the number of memory
DIMMs replaced due to correctable errors being logged.
- A problem was fixed that caused unrecoverable SRC B130CD03
to be erroneously logged.
- A problem was fixed that caused SRC B7000602 to be
erroneously logged at power on.
- The firmware was enhance to prevent a potential deadlock in
the
opposite-side storage drawer if all of the cross-coupled dlinks between
a compute supernode (at runtime) and a storage drawer (at runtime) are
taken down. This problem also affects indirect routing from
compute to
storage over cross-coupled links.
- A problem was fixed that caused the Local Network
Management
Controller (LNMC) to be set to the wrong state during a service
processor (DCCA) fail-over. If this problem occurs, the most
likely
symptom will be a communication failure on the ISR network.
- A problem was fixed that caused a partition running AIX to
crash.
- A new level of optical link firmware is included in this
service
pack, and the optical link firmware update function is enabled.
The
new optical link device firmware will be automatically installed the
next time the node is booted after this service pack is
installed. Please see
"Additional
Details About Installing This Service Pack" in
the "Important Information" section.
- The firmware was enhanced to increase the threshold of soft
NVRAM errors on the service processor to 32 before SRC B15xF109 is
logged. (Replacement of the service processor is recommended if
more than one B15xF109 is logged per week.)
|