IBM POWER9 Systems LC Server Firmware

Applies to: AC922 (8335-GTG)

This document provides information about the installation of Licensed Machine or Licensed Internal Code, which is sometimes referred to generically as microcode or firmware.

Contents

1.0 Systems Affected

1.1 Minimum ipmitool Code Level

1.2 Fix level Information on IBM Open Power Components and Operating systems

1.3 Minimum xCAT level 2.13.4 for use in firmware updates

1.4 Required NVIDIA CUDA driver level for the Tesla V100 GPU

1.5 Required Broadcom Ethernet driver level for the BCM5719

2.0 Important Information

3.0 Firmware Information

3.1 Firmware Information and Description

4.0 Operating System Information

4.1 Linux Operating System

4.2 How to Determine the Level of a Linux Operating System

4.3 How to Determine if the opal-prd (Processor Recovery Diagnostics) package is installed

5.0 How to Determine The Currently Installed Firmware Level

6.0 Downloading the Firmware Package

7.0 Installing the Firmware

7.1 IBM Power Systems Firmware maintenance

7.2 OpenBMC System Firmware Update using openbmctool

8.0 System Management and Virtualization

8.1 BMC Service Processor IPMI

8.2 Open Power Abstraction Layer (OPAL)

8.3 Intelligent Platform Management Interface (IPMI)

8.4 Petitboot bootloader

9.0 Quick Start Guide for Installing Linux on the LC 8335 server

10.0 Change History

1.0 Systems Affected

This package provides firmware for the Power System AC922 (8335-GTG) server only.

The firmware level in this package is:

•OP910.24 / PNOR OP9_v1.19_1.189 / BMC ibm-v2.0-0-r46

Note: Before updating to the OP910.24 or later firmware level, ensure that the Linux OS is at RHEL 7.5-ALT LE with the third Z-stream and the NVIDIA CUDA driver for the NVIDIA Tesla GPUs on the system is at the recommended driver level 396.44 or later, or the minimum level 396.26. See "1.4 Required level for NVIDIA CUDA driver for the Tesla V100 GPU" for more information. After the firmware update, ensure that the BCM1579 ethernet driver is updated to level, 5719-v1.43 NCSI v1.4.22.0. See "1.5 Required Broadcom Ethernet driver level for the BCM5719". The complete set of update instructions covering the OS, CUDA driver, firmware, and Ethernet driver can be found in the readme guide on Fix Central called "WSP_CUDA_BCM5719_FWUPG_GUIDE.txt".

1.1 Minimum ipmitool Code Level

This section specifies the "Minimum ipmitool Code Level" required by the System Firmware for managing the system. Open Power requires ipmitool level v1.8.15 or later to execute correctly on the OP910 firmware. It must be capable of establishing a IPMI v2 session with the ipmi support on the BMC.

Verify your ipmitool level on your linux workstation using the following command:

bash-4.1$ ipmitool -V

ipmitool version 1.8.15

If you are need to update or add impitool to your Linux workstation , you can compile ipmitools (current level 1.8.15) for Linux as follows from the Sourceforge:

1.1.1 Download impitool tar from http://sourceforge.net/projects/ipmitool/ to your linux system

1.1.2 Extract tarball on linux system

1.1.3 cd to top-level directory

1.1.4 ./configure

1.1.5 make

1.1.6 ipmitool will be under src/ipmitool

You may also get the ipmitool package directly from your workstation linux packages.

1.2 Fix level Information on IBM Open Power Components and Operating systems

For specific fix level information on key components of IBM Power Systems LC and Linux operating systems, please refer to the documentation in the IBM Knowledge Center for the AC922 (8335-GTG):

https://www.ibm.com/support/knowledgecenter/POWER9/p9hdx/8335_gtg_landing.htm

1.3 Minimum xCAT level 2.13.4 for use in firmware updates

If using xCAT on the host OS to do firmware updates, the minimum xCAT level that should be used is 2.13.4 because it has stability improvements for the firmware update process. See the xCAT 2.13.4 release notes below for more information.

https://github.com/xcat2/xcat-core/wiki/XCAT_2.13.4_Release_Notes

1.4 Required NVIDIA CUDA driver level for the Tesla V100 GPU

The Linux OS has a NVIDIA CUDA driver that must be at recommended level 396.44 or later, or minimum level 396.26 to be compatible with OP910.24. Without this driver, a GPU which has faulted and gone through a GPU reset can cause a Terminate Immediate (TI) for the system. The recommended level for the NVIDIA CUDA driver is level 396.44 to get ATS performance improvements.

The Power AC922 server delivers four Tesla V100 with NVLink GPUs supported in two processor sockets.

Feature #EC4J provides the NVIDIA Tesla V100 GPU with NVLINK Air-Cooled (16 GB). CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model created by NVIDIA and implemented by the graphics processing units (GPUs) that they produce.

The Tesla CUDA driver can be obtained at the download NVIDIA link of “https://www.nvidia.com/content/DriverDownload-March2009/confirmation.php?url=/tesla/396.44/nvidia-driver-local-repo-rhel7-396.44-1.0-1.ppc64le.rpm&lang=us&type=Tesla”

The NVIDIA "http://www.nvidia.com/Download/index.aspx?lang=en-us" link using the following information can be used to do a manual search for the driver:

Manually find drivers for my NVIDIA products.

Product Type: Tesla

Product Series: V-Series

Product: Tesla V100

Operating System: Linux POWER LE RHEL 7

CUDA Toolkit: 9.2

Language: English(US)

Search results:

Version: 396.44

Release Date: 2018.8.6

Operating System: Linux POWER LE RHEL 7

CUDA Toolkit: 9.2

Language: English (US)

File Size: 47.28 MB

1.5 Required Broadcom Ethernet driver level for the BCM5719

The tools and driver images are provided in Fix Central to update the BCM5719 ethernet adapter to NCSI level v1.4.22.0.

Use the steps provided in the WSP_CUDA_BCM5719_FWUPG_GUIDE.txt readme file to perform the needed updates.

I/O Adapter driver level before update:

Dual port BCM5719 with shared port with BMC (NCSI)

Adapter FW: 5719-v1.43 NCSI v1.3.12.0

I/O Adapter level after update:

firmware-version: 5719-v1.43 NCSI v1.4.22.0

2.0 Important Information

Downgrading firmware from any given release level to an earlier release level is not recommended.

If you feel that it is necessary to downgrade the firmware on your system to an earlier release level, please contact your next level of support.

Concurrent Firmware Updates not available for LC servers.

Concurrent system firmware update is not supported on LC servers.

3.0 Firmware Information

Use the following examples as a reference to determine whether your installation will be concurrent or disruptive.

For the LC server systems, the installation of system firmware is always disruptive.

3.1 Firmware Information and Description

The BMC and PNOR image tar files are used to update the primary side of the PNOR and the primary side of the BMC only, leaving the golden sides unchanged.

List of seven files published:

1.obmc-witherspoon-ibm-v2.0.ubi.mtd.tar
2.witherspoon-IBM-OP9_v1.19_1.172.pnor.squashfs.tar
3.WSP_CUDA_BCM5719_FWUPG_GUIDE.txt - readme file for update sequence of steps for OS, GPU and Ethernet drivers, and Firmware
4.fix_bcm_5719_crc.py - bcm5719 driver install script
5.python3_fix_bcm_5719_crc.py - bcm5719 driver install script (python3 version)
6.lnxfwupg.zip - Broadcom driver update files
7.nx1_ncsi_v1.4.22_PointDrop.zip - Broadcom NCSI driver image

Filename	Size	Checksum
obmc-witherspoon-ibm-v2.0.ubi.mtd.tar	18196480	26865fe06b6a98e627755301a7e224a6
witherspoon-IBM-OP9_v1.19_1.189.pnor.squashfs.tar	22999040	1e78e8c09b0071a76c38a502239d57d4

WSP_CUDA_BCM5719_FWUPG_GUIDE.txt	6949	d549a599b95280d4cb7995f3ea436b09
fix_bcm_5719_crc.py	4344	9fa2c74a376aa7aa139deec13b865aa9
python3_fix_bcm_5719_crc.py	4352	81ee6fa3f80c0ddb5f9ab594e65011c0
lnxfwupg.zip	1398735	23cb464558fc532b24a3106f05bf2ac4
nx1_ncsi_v1.4.22_PointDrop.zip	75049	5a1616b6f1af3ab3e11bc3940fac1c0c

Note: The Checksum can be found by running the Linux/Unix/AIX md5sum command against the Hardware Platform Management (hpm) file (all 32 characters of the checksum are listed), ie: md5sum <filename>

After a successful update to this firmware level, the PNOR components and BMC should be at the following levels.

To display the PNOR level, use the following BMC command: "cat /var/lib/phosphor-software-manager/pnor/ro/VERSION"

And the BMC command line command "cat" can be used to display the BMC level: "cat /etc/os-release".

Note: FRU information for the PNOR level does not show the updated levels via the fru command until the system has been booted once at the updated level.

PNOR firmware level: driver content

display pnor FW level using this cmd: "cat /var/lib/phosphor-software-manager/pnor/ro/VERSION"

IBM-witherspoon-ibm-OP9_v1.19_1.189
op-build-v2.0.5-322-gbeba89b
buildroot-2018.02.1-6-ga8d1126
skiboot-v6.0.6
hostboot-e5dfba1-pe671f7e
occ-f796766
linux-4.16.13-openpower1-p315a9a7
petitboot-v1.7.2-pe797756
machine-xml-94a137f
hostboot-binaries-37be536
capp-ucode-p9-dd2-v4
sbe-c34fda4

openBMC level:

display BMC FW level via ssh session on the BMC , using this cmd root@witherspoon:~# cat /etc/os-release

id: openbmc-phosphor

name: Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro)

version: ibm-v2.0

version_id: ibm-v2.0-0-r46-0-gbed584c

pretty_name: Phosphor OpenBMC (Phosphor OpenBMC Project Reference Distro) ibm-v2.0

build_id: ibm-v2.0-0-r46

OP910
For Impact, Severity and other Firmware definitions, Please refer to the below 'Glossary of firmware terms' url:
http://www14.software.ibm.com/webapp/set2/sas/f/power5cm/home.html#termdefs

OP9_v1.19.1.189/ OP910.24

08/16/18

Impact: Availability Severity: SPE

New features and functions

Support was added for 24x7 On-Chip Controller (OCC) counter data collection. It allows a customer to monitor utilization and throughput of memory, buses and other system components. The data it collects is stored in system memory and the firmware provides a call interface for applications to read out this data.

Support was added for parity error checking of the GPU data on the NVLink Datalink Layer (NDL), providing earlier memory fault detection and recovery retries to eliminate transient faults.

System firmware changes that affect all systems

A problem was fixed for a GPU NVLINK writing out of range to a MMIO section of memory with byte-enabled writes that caused a machine check. With the fix, the out of range write is handled (detected) to cause a process core dump, but leaves the system in a usable state.

A problem was fixed for GPU workloads using unified memory with address translation service (ATS) sometimes hanging after resetting the GPUs. The trigger for the failure was putting the NPU in the fenced state via the "NPU Fence State" register with SCOM address 0x5011696. With the fix, the GPU fencing is handled using the NTL (NVLink Transaction Layer) reset register bits instead.

A problem was fixed for NPU log messages that were missing the CPU chip identifiers. With the fix, CPU taking the HMI (Host Maintenance Interrupt) is listed along with the NPU FIR register values.

A problem was fixed for the On-Chip Controller (OCC) not being able communicate to the GPUs for thermal monitoring or power capping.. This means the GPUs could overheat or consume too much power for the configuration. The GPUs will continue to operate with the last power cap that was sent. The fans will increase to the maximum speed while in this mode where the OCC cannot read the GPU temperatures.

A problem was fixed for the On-Chip Controller inadvertently disabling the MMIO ATSD flush bits, thereby potentially reducing the performance of the address translation service (ATS) unified memory for the GPU.

A problem was fixed for user applications timing out on the GPU operation for accessing the address translation service (ATS) unified memory, causing an HMI and system termination. With the fix, the ATSD timeout has been disabled, so the user applications can wait for GPU read or write operations to be completed without regard for the time needed for the operation.

A problem was fixed for the SBE timer being stuck and unavailable to the host applications. This forces OPAL to use legacy timer loops for timers at the cost of additional processor bandwidth. Here are the messages that are logged for the problem that occurs on every boot:

[ 194.494559313,3] SBE: Timer stuck, falling back to OPAL pollers.

[ 194.494624185,3] SBE: You will likely have slower I2C and may have experienced increased jitter.

A problem was fixed for PCIe4 CX5 adapter performance with an increase of performance of 40% for DMA read requests. The adapter affected is the Mellanox CX5 PCIe4 100Gb IB CAPI with feature codes #EC62 with CCIN 2CF1 and #EC64 with CCIN 2CF2. Without the fix, each read request requires a retry to work.

A problem was fixed for user code running on a GPU that can perform invalid commands to the MMIO space and cause an HMI that brings down the system. With the fix, ill-formatted commands to the MMIO space from the GPU will not be processed as a fatal exception but responses will be set to 0xFFFFFFFF and the GPU will receive a normal response code. The user GPU application can look for the bad response and fail, but the system will continue running without taking an HMI, allowing all other workloads to continue normally.

A problem was fixed in Petitboot V1.7.2 for Petitboot exiting to the shell with xCAT genesis in the menu when trying to do a network boot. Petitboot was timing out when trying to access the ftpserver but it was not doing the network re-queries necessary for a proper retry. If this error happens on a system, it can be made to boot with the following two steps:

1) Type the word "exit" and press enter key. This brings it back to petitboot menu.

2) Press the enter key again to start the boot of the xCAT image.

OP9_v1.19.1.172 / OP910.22

06/22/18

Impact: Data Severity: HIPER

New features and functions

Support was added to provide the processor VPD data for the serial number and part number on the host OS. The information can be found in the /proc/device-tree/vpd/root-node-vpd directory path. For example, the following directory path contains the serial-number file for a processor:

" /proc/device-tree/vpd/root-node-vpd@a000/enclosure@1e00/backplane@800/processor@1000/serial-number".

System firmware changes that affect all systems

HIPER/Pervasive: A firmware change was made to address a rare case where a memory correctable error on POWER9 servers may result in an undetected corruption of data.

A problem has been fixed to not guard processor cores on memory checkstop errors resulting from a GPU failure. If this problem occurs, the processor cores can be restored by manually clearing the guard records.

A problem has been fixed for the NPU register data logging to include critical information for NVLINK failures such as the CPU chip identifiers. This information is needed to be able to isolate the cause of the NVLINK faults.

A problem has been fixed for systems unexpectedly running with all processors at lower frequencies than would be expected for Workload Optimized Frequency (WOF) ultra-turbo mode. There was no eSEL or callout for the processor causing the error that disabled the WOF mode. With the fix, there is an eSEL and callout for the WOF fault that identifies the errant processor that needs to be replaced.

A problem has been fixed for a PCIe adaper running in CAPP mode having a missing MMIO Base Address Register (BAR) entry that causes a failure of the adapter and a fence off of two of the four ports of the adapter.

A problem has been fixed for a slow start up of a process that can occur when the system had been previously in an idle state.

A problem has been fixed for a TOD error that can cause a soft lockup of the kernel. A 'soft lockup' is defined as a bug that causes the kernel to loop in kernel mode for more than 20 seconds, without giving other tasks a chance to run. The current stack trace is displayed upon detection and, by default, the system will stay locked up.

A problem was fixed for a failure in DDR4 RCD (Register Clock Driver) memory initialization that causes half of the DIMM memory to be unusable after an IPL. This is an intermittent problem where the memory can sometimes be recovered by doing another IPL. The error is not a hardware problem with the DIMM but it is an error in the initialization sequence needed get the DIMM ready for normal operations.

A problem was fixed for a processor core that cannot be awakened or a timeout in the On Chip Controller when switching Workload Optimized Frequency (WOF) modes from disabled to enabled. These errors can cause a reduction in performance by running with fewer cores or by running at the safe mode frequencies.

OP9_v1.19.1.160 / OP910.21

05/18/18

Impact: Availability Severity: SPE

New features and functions

Support to enable Call Home ESELs to allow system data such as On-Chip Controller (OCC) telemetry to be collected remotely.

Support has been removed from XIVE interrupt controller for the store EOI operation. Hardware has limitations which would require a sync after each store EOI to make sure the MMIO operations that change the ESB state are ordered. This would be performance prohibitive and the PCI Host Bridges (PHBs) do not support the synchronization.

System firmware changes that affect all systems

A problem was fixed for extraneous error logging and console messages for nonexistent NPU registers whenever a processor error occurs.

A problem was fixed for a false call out of a processor on a INTCQFIR[27]. This FIR bit should not call out the processor as the processor has not failed. The error is recoverable and should only serve as an early warning indication.

A problem was fixed for transactional memory that could result in a wrong answer for processes using it. This is a rare problem requiring L2 cache failures that can affect the process determining correctly if a transaction has completed.

A problem was fixed for Workload Optimized Frequency (WOF) where parts may have been manufactured with bad IQ data that requires filtering to prevent WOF from being disabled.

A problem was fixed for the opal-prd service consuming 100% of CPU during and after boot to the host. This is an infrequent intermittent problem that can be circumvented by a reboot of the system.

A problem was fixed for VRMs drawing current over the specification. This occurred whenever heavy work loads went above 372 amps with WOF enabled. At 372 amps, a rollover to value "0" for the current erroneously occurred and this allowed the frequency of the processors in the system to exceed the normally expected values.

A problem was fixed for the wrong DIMM being called out on over-temperature failures with B1xx2A30 errors logged This should be a rare failure as it requires a DIMM to exceed its maximum specified operating temperature.

A problem has been fixed to clean up memory after a GPU has failed. This fix fences off failed GPUs on a GPU reset. The fencing ensures that access to memory behind the links will not lead to HMIs. but instead SUE's will be populated in cache. Before installing this fix, the NVIDIA Tesla driver must be updated in the Linux OS to version level 396.26 as a prerequisite. Feature #EC4J provides the NVIDIA Tesla V100 GPU with NVLINK Air-Cooled (16 GB) that requires the updated driver. Without this driver update, a GPU that has faulted and gone through a GPU reset can cause a Terminate Immediate (TI) or HMI for the system. The Tesla CUDA driver can be obtained at the direct NVIDIA link of "http://www.nvidia.com/download/driverResults.aspx/134380/en-us":

TESLA DRIVER FOR LINUX POWER RHEL 7

Version: 396.26

Release Date: 2018.5.17

Operating System: Linux POWER LE RHEL 7

CUDA Toolkit: 9.2

Language: English (US)

File Size: 47.26 MB

A problem has been fixed to add part and serial numbers to the processors when accessed through the device tree.

A problem has been fixed to make the OS aware of the DARN random number generator at 0x00200000 PPC_FEATURE2_DARN) and the SCV syscall at 0x00100000 (PPC_FEATURE2_SCV). Without this fix, these service constants are not defined in the OS userspace.

OP9_v1.19.1.154 / OP910.20

04/18/18

Impact: Availability Severity: SPE

This Service Pack includes updates in response to Recent Security Vulnerabilities, New Features & Functions and System Firmware Updates. Details of each are below:

Response for Recent Security Vulnerabilities

In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue number CVE-2017-5754 with firmware initializations augmenting an earlier fix provided in FW level OP910.10. Operating System updates are required in conjunction with the new FW level for addressing CVE-2017-5754.

New features and functions (not related to above CVE)

Support for voltage-droop monitors (VDM) to provide for improved system reliability during periods of unstable voltage from the power supply. The P9 processor uses an adaptive clock strategy to reduce the system power usage during power supply droop events by embedding analog VDMs that direct a digital phase-locked loop (DPLL) to immediately reduce clock frequency in response to the droop event

Support for Workload Optimized Frequency (WOF). This feature provides the maximum processor frequency in order to increase system performance based on workload characteristics.

Support was added for using "ipmitool mc info" from the host OS to get the BMC firmware level.

Support was added to increase the number of NPU2 register contents dumped for NVLINK Hypervisor Maintenance Interrupts (HMIs) and to add logging for the HMI actions.

Support was added to make the Self Boot Engine (SBE) fault indicator bits recoverable. This means if a SBE seeprom error occurs, recovery action will be taken to prevent an IPL failure or system outage.

System firmware changes that affect all systems

A problem was fixed for an On-Chip Controller (OCC) not going active caused by a race condition in the initialization of the OCCs. This problem is intermittent and can be resolved by a re-IPL of the system.

A problem was fixed for the BMC journal file getting overwritten with network change notifications when there is a IPv6 router in the local subnet.

A problem was fixed for the BMC version fields not being set as shown by "ipmitool mc info" and the Petitboot System Information UI. The BMC can be accessed by SSH (secured shell) login and the following command run to show the BMC firmware level: "cat /etc/os-release". Look for the "VERSION=" string that has the BMC version identifier appended to it.

A problem was fixed for the display of the power supply output outage that in one instance was showing as 390V instead of 12V. The voltage is at the right level but recent revisions of the power supply firmware had a change in how the output voltage was calculated, causing the displayed values to read too high.

A problem was fixed for VLAN ID showing as "Disabled" with the "ipmitool lan print 1" after the VLAN was set by inband by the OS. The VLAN is set correctly and functional, but the display of the VLAN information, while initially correct, went to "Disabled" during the first minute after the operation.

A problem was fixed for no amber fault LEDs being lit (or SELs reported) for front or rear fan rotors that have a RPM of zero due to blockage or other hardware error.

A problem was fixed for the host failing during a reset of the BMC when a host to BMC message had a time out. This problem is rare as the host normally stays up and running when the BMC is reset.

A problem was fixed for multi-rotor failures in a fan not causing a system shutdown, making it possible for the system to fail from an overheat condition that could be destructive to other system FRUs. This problem is rare as it requires that more than one rotor fail in a system fan at the same time.

A problem was fixed for a change or enablement of the NTP time server not forcing a network time synchronization, potentially leaving the BMC local time different from the network time. This problem can be circumvented by a reset of the BMC.

A problem was fixed for a BMC reset causing the On-Chip Controller (OCC) to fail and the system going into Safe mode. This is an infrequent problem that is triggered if a BMC reset and a OCC reset happen at the same time such that the BMC is unable to respond to OCC messages, forcing the OCC into a failed state.

A problem was fixed for error log "BC8A2502 - IPMI::RC_INVALID_SENDRECV" occurring during the system IPL. The On-Chip Controller (OCC) error is automatically recovered, so the error log does not impact the system.

A problem was fixed for error log " BC8A2507 - IPMI::RC_SENSOR_NOT_PRESENT" that can occur on a system power on if the BMC was reset at system runtime previously. When the BC8A2507 error occurs, the host uses the default value for the sensor data. The problem will persist for the IPL until the BMC is reset.

A problem was fixed for a power supply FQPSPPW0034M error persisting with enclosure fault LEDs lit even after the power supply problem has been corrected. The fault can be triggered by a momentary loss of AC or by unplugging and plugging AC into the power supply.

A problem was fixed for Coherent Accelerator Processor Proxy (CAPP) mode for the PCI Host Bridge (PHB) to improve DMA write performance by enabling channel tag streaming for the PHB. With this enabled, the DMA write does not have to wait for a response before sending a new write command on the bus.

A problem was fixed for the Open-Power Flash tool "pflash" failing with a blocklevel_smart_erase error during a pflash. This problem is infrequent and is triggered if pflash detects a smart erase fits entirely within one erase block.

A problem was fixed in the Petitboot user interface to handle cursor mode arrow keys for the VT100 'application' cursor to prevent mis-interpreting an arrow key as an escape key in some situations. For more information on the VT100 cursor keys, see http://www.tldp.org/HOWTO/Keyboard-and-Console-HOWTO-21.html.

A problem was fixed in the Petitboot user interface to cancel the autoboot if the user has exited the Petitboot user interface. This prevents the user dropping to the shell and then having the machine boot on them instead of waiting until the user is ready for the boot.

A problem was fixed in the Petitboot parsing of manually-specified configuration files that caused the parser to create file paths relative to the downloaded file's path, not the original remote path.

A problem was fixed for a failure to IPL with SRC BC8A0506 logged for a Phase Lock Loop error (PLL) in the PCIe Host Bridge (PHB). This problem is very infrequent. The fix does the correct call out of the failed FRU, allowing the IPL to continue.

A problem was fixed for a system IPL hang that shows in the log as the host going to a quiesce state with the OS inactive. This is a rare problem that may be recovered by a power off and re-IPL of the system. This problem is triggered by a higher than normal level of interrupts from the Power Supply Unit (PSU).

A problem was fixed for the VPD serial number not being updated on the replacement of a planar. The VPD update failed with the following message: "ERROR: (ECMD): ecmd - 'putvpdkeyword' returned with error code 0x20300001 (ERROR OPENING DECODE FILE). ERROR: A problem occurred updating the serial number(OSYS:SS). Please see previous output for reason ".

A problem was fixed for the CAS latency calculation for memory to improve its accuracy to reduce the potential for DIMM failures due to memory timing errors. Column Access Strobe (CAS) latency is the delay time between the moment a memory controller tells the memory module to access a particular memory column on a RAM module, and the moment the data from the given array location is available on the module's output pins.

A problem was fixed for clearing DIMM guard records when there was a repair marked in the VPD and that prevented the DIMM from being unguarded. With the fix, the VPD mark will be cleared if the guard record is cleared for the FRU, allowing it to be enabled on the next IPL.

A problem was fixed for the Self Boot Engine (SBE) error identification on failure. The SRR0/SRR1/LR/Local FI2C register are now extracted to allow the following SBE errors to now be identified:

100 - Program interrupt , promoted

101 - Instruction storage interrupt, promoted

110 - Alignment interrupt, promoted

111 - Data storage interrupt, promoted

A problem was fixed for read margins to improve the margins on DIMMs, reducing the number of DIMM failure occurrences.

A problem was fixed for the Hostboot reset to enable error recovery of Hostboot through the reset path. Without the fix, the Self Boot Engine (SBE) fails to reboot on the Hostboot reset, preventing error recovery for the Hostboot failures.

A problem was fixed for a memory training error that could caused DIMMS to be marked as bad or memory ports to be deconfigured. This problem is rare and triggered by an incorrect internal voltage level.

A problem was fixed for a Phase Lock Loop (PLL) error causing a checkstop but not calling out and guarding the failed hardware. There is then a chance the failure will recur on the next IPL of the system.

A problem was fixed to reduce memory latency in memory blocks where bad memory bits have been marked.

A problem was fixed for time and date fields being zero in Hostboot error log entries for the time/date of the error occurrence.

A problem was fixed for an extraneous "MCBISTFIR[3]: broadcast out of sync" error during memory diagnostics if a Register Clock Driver (RCD) parity error occurs. The "broadcast out of sync" error should be ignored when isolating the RCD fault. This problem is triggered if the RCD parity error occurs while the DDR4 memory is in broadcast mode.

A problem was fixed for BC8A2AC5 and BC8A2AC4 errors that prevented the reading of On-Chip Controller (OCC) thermal readings from the Analog Power Subsystem Sweep (APSS) bus. This is a very rare problem.

A problem was fixed for a processor fault that caused the master processor core to guard and prevented an IPL of the system with SRC BC13E540. With the fix, the system will IPL up on the available processor cores. This error only occurs if the master core is faulted. Faults on the other cores are handled correctly and do not stop an IPL.

A problem was fixed for a flood of OPAL error messages that can occur for a processor fault. The message "CPU ATTEMPT TO RE-ENTER FIRMWARE" appears as a large group of messages and precede the relevant error messages for the processor fault. A reboot of the system is needed to recover from this error.

OP9_v1.19_1.111 / OP910.10

01/18/2018

Impact: Availability Severity: SPE

This Service Pack includes updates in response to Recent Security Vulnerabilities, New Features & Functions and System Firmware Updates. Details of each are below:

Response for Recent Security Vulnerabilities

In response to recently reported security vulnerabilities, this firmware update is being released to address Common Vulnerabilities and Exposures issue numbers CVE-2017-5715, CVE-2017-5753 and CVE-2017-5754. Operating System updates are required in conjunction with this FW level for CVE-2017-5753 and CVE-2017-5754.

New features and functions (not related to above CVE’s)

Support was added for increasing the number of BMC error logs from 100 to 200 and changing the error log to roll over old entries when full instead of stopping the logging of errors. Without this feature, the error log would get full at 100 entries and error logging would be stopped until some of the error logs were purged to make room for new entries.

Support was added to enable power supply redundancy.

Enable air-cooled fan control to optimize fan speeds for the temperature conditions and improve fan speed control to minimize fan speed oscillation.

Support was added for advanced power supply fault monitoring to improve fault isolation, error detection, and reliability.

Support was added for forcing a new dump type, Checkstop, if the host has a checkstop. Without this new dump, critical debug information is missing because the /var/lib/obmc_console.log does not

System firmware updates (not related to above CVE’s)

A problem was fix for intermittent processor core hangs that caused checkstops with code "NCU no response to snooped TLBIE".

A problem was fixed for fans being reported as "Nonfunctional". This error occurred during peak loads on the BMC that tripped a watchdog process, causing the fans to speed up to the maximum speed. An error in the fan recovery to normal speed resulted in the "Nonfunctional" status.

A problem was fixed for a processor replacement that caused extra cores to be reported as present that do not exist. This happens if the new processor has fewer cores than the processor that is being replaced. This problem can be recovered by doing a factory reset on the BMC.

A problem was fixed for GPU temperatures not being reported on systems that have maximum DIMM configurations for the memory. Without the fix, reducing the number of DIMMs plugged in would make available On-Chip Controller (OCC) slots for missing GPU temperatures to be reported.

A problem was fixed for the host time inadvertently changing when a BMC time change is requested in NTP mode with Split ownership. The problem can be recovered by IPLing to the host and the NTP server will correct the host time.

A problem was fixed for the host time skewing ahead in time after time ownership is split and the clock has been set from the BMC. The problem can be recovered by setting the correct time from the host.

A problem was fixed to reject the use of the path /org/openbmc on the REST API URIs. This affects the API /org/openbmc/sensors/host/PowerSupplyRedundancy which is no longer valid.

A problem was fixed for the BMC REST server going into a retry hang with the BMC becoming unresponsive when given a REST command with a bad data format. Without the fix, the REST server will repeatedly retry the bad command, causing a denial of service for all other users of the BMC.

A problem was fixed for an On-Chip Controller (OCC) read failure with ERRNO=11 during a IPL. This intermittent problem was caused by an overflow of the total system power value from the OCC. The system can be recovered by retrying the IPL.

A problem was fixed for the ECC error recovery. Error recovery was not working and the ECC errors would prevent the boot.

A problem was fixed for an intermittent power on failure with message "Error in mapper call to get service name". To recover from this problem. power cycle the BMC and try the boot again.

A problem was fixed for an On-Chip Controller (OCC) read failure with ERRNO=19 during a power off of the system. This intermittent problem is an extraneous errror log and can be ignored.as the power off is successful.

A problem was fixed for an intermittent error message when activating firmware during a firmware update. This extraneous error message occurred with moderate frequency. This is internal server 500 error message returned on the REST enumerate request. The error message can be ignored because there is not a problem with the firmware activate.

A problem was fixed for the power button LED not blinking when in the standby state (not powered on). Without the fix, the power button always has a solid green LED, regardless of power on or power off state.

A problem was fixed for intermittent host checkstops caused by NCU and PCI time-out mismatches. PCI timeouts that are longer than NCU timeouts may cause checkstops on the host.

OP9_v1.19_1.94 / OP910.00

12/22/2017

Impact: New Severity: New

New features and functions for MTM 8335-GTG:

GA Level

4.0 Operating System Information

OS levels supported by the LC 8335 servers:

- Minimum level is Red Hat Enterprise Linux 7.5 for IBM Power LE (POWER9), also known as RHEL 7.5-ALT LE, with third Z-stream or later (https://access.redhat.com/errata/RHBA-2018:2467 has the needed kernel "kernel-alt-4.14.0-49.10.1.el7a.src.rpm" )

- NVIDIA Telsa CUDA recommended driver level 396.44 or later, or minimum driver level 396.26 from the CUDA 9.2 toolkit

- Broadcom Ethernet driver level for the BCM5719 I/O adapter of 5719-v1.43 NCSI v1.4.22.0 or later.

IBM Power LC 8335 servers supports Linux which provides a UNIX like implementation across many computer architectures. Linux supports almost all of the Power System I/O and the configurator verifies support on order. For more information about the software that is available on IBM Power Systems, see the Linux on IBM Power Systems website:

http://www.ibm.com/systems/power/software/linux/index.html

4.1 Linux Operating System

The Linux operating system is an open source, cross-platform OS. It is supported on every Power Systems server IBM sells. Linux on Power Systems is the only Linux infrastructure that offers both scale-out and scale-up choices.

A supported version of Linux on the Power LC 8335 is Red Hat Enterprise Linux 7.5 for IBM Power LE (POWER9) (RHEL 7.5-ALT LE).

For additional questions about the availability of this release and supported Power servers, consult the Red Hat Hardware Catalog at

https://access.redhat.com/products/red-hat-enterprise-linux/#addl-arch.

For more information about Linux on Power, see the Linux on Power developer center at https://developer.ibm.com/linuxonpower/

For information about the features and external devices that are supported by Linux, see this website:

http://www.ibm.com/systems/power/software/linux/index.html

4.2 How to Determine the Level of a Linux Operating System

Use one of the following commands at the Linux command prompt to determine the current Linux level:

•cat /proc/version
•uname -a

The output string from the command will provide the Linux version level.

4.3 How to Determine if the opal-prd (Processor Recovery Diagnostics) package is installed

The opal-prd package on the Linux system collects the OPAL Processor Recovery Diagnostics messages to log file /var/log/syslog. It is recommended that this package be installed if it is not already present as it will help with maintaining the system processors by alerting the users to processor maintenance when needed.

On Red Hat Linux, perform command "rpm -qa | grep -i opal-prd ". The command output indicates the package is installed on your system if the rpm for opal-prd is found and displayed. This package provides a daemon to load and run the OpenPower firmware's Processor Recovery Diagnostics binary. This is responsible for run-time maintenance of Power hardware. If the package is not installed on your system, the following command can be run on Red Hat to install it:

sudo yum update opal-prd

5.0 How to Determine The Currently Installed Firmware Level

To display the PNOR level, use the following BMC command: "cat /var/lib/phosphor-software-manager/pnor/ro/VERSION"

And the BMC command line command "cat" can be used to display the BMC level: "cat /etc/os-release".

Note: the "cat" commands are run after ssh to the BMC as root and the default password is 0penBmc (where 0 is the zero character).

6.0 Downloading the Firmware Package

Follow the instructions on Fix Central. You must read and agree to the license agreement to obtain the firmware packages.

7.0 Installing the Firmware

Note: Before updating to the OP910.24 or later firmware level, ensure that the Linux OS is at RHEL 7.5-ALT LE with the third Z-stream and the NVIDIA CUDA driver for the NVIDIA Tesla GPUs on the system is at the recommended driver level of 396.44 or later, or the mimimum level 396.26. See "1.4 Required level for NVIDIA CUDA driver for the Tesla V100 GPU" for more information. After the firmware update, ensure that the BCM1579 ethernet driver is updated to level 5719-v1.43 NCSI v1.4.22.0. See "1.5 Required Broadcom Ethernet driver level for the BCM5719". The complete set of update instructions covering the OS, CUDA driver, firmware, and Ethernet driver can be found in the readme guide on Fix Central called "WSP_CUDA_BCM5719_FWUPG_GUIDE.txt".

7.1 IBM Power Systems Firmware maintenance

The updating and upgrading of system firmware depends on several factors, such as the current firmware that is installed, and what operating systems is running on the system.

These scenarios and the associated installation instructions are comprehensively outlined in the firmware section of Fix Central, found at the following website:

http://www.ibm.com/support/fixcentral/

Any hardware failures should be resolved before proceeding with the firmware updates to help insure the system will not be running degraded after the updates.

7.2 OpenBMC System Firmware Update using openbmctool

The process of updating firmware on the OpenBMC managed servers is documented below.

The sequence of events that must happen is the following:

•Power off the Host

•Update and Activate BMC

•Update and Activate PNOR

•Reboot the BMC (applies new BMC image)

•Power on the Host (applies new PNOR image)

The OpenBMC firmware updates (BMC and PNOR) for the LC 8335 servers can be managed via the command line with the openbmctool.

The openbmctool is obtained using the IBM Support Portal.

1.Go to the IBM Support Portal.
2.In the search field, enter your machine type and model. Then click the correct product support entry for your system.
3.From the Downloads list, click the openbmctool for your machine type and model.
4.Follow the instructions to install and run the openbmctool. You will need to provide the file locations of the BMC firmware image tar and PNOR firmware image tar that must be downloaded from Fix Central for the update level needed.

Information on the openbmctool and the firmware update process can be found in the IBM Knowledge Center:

https://www.ibm.com/support/knowledgecenter/POWER9/p9ei8/p9ei8_update_firmware_openbmctool.htm .

8.0 System Management and Virtualization

The service processor, or baseboard management controller (BMC), provides a hypervisor and operating system-independent layer that uses the robust error detection and self-healing functions that are built into the POWER processor and memory buffer modules. Open power application layer (OPAL) is the system firmware in the stack of POWER processor-based Linux-only servers.

8.1 BMC Service Processor IPMI

The service processor, or baseboard management controller (BMC), is the primary control for autonomous sensor monitoring and event logging features on the LC server.

The BMC supports the Intelligent Platform Management Interface (IPMI) for system monitoring and management. The BMC monitors the operation of the firmware during the boot process and also monitors the OPAL hypervisor for termination.

8.2 Open Power Abstraction Layer (OPAL)

The Open Power Abstraction Layer (OPAL) provides hardware abstraction and run time services to the running host Operating System.

For the 8335 servers, only the OPAL bare-metal installs can be used.

Find out more about OPAL skiboot here:

https://github.com/open-power/skiboot

8.3 Intelligent Platform Management Interface (IPMI)

The Intelligent Platform Management Interface (IPMI) is an open standard for monitoring, logging, recovery, inventory, and control of hardware that is implemented independent of the main CPU, BIOS, and OS. The LC 8335 servers provide one 10M/100M baseT IPMI port.

The ipmitool is a utility for managing and configuring devices that support IPMI. It provides a simple command-line interface to the service processor. You can install the ipmitool from the Linux distribution packages in your workstation, sourceforge.net, or another server (preferably on the same network as the installed server).

For installing ipmitool from sourceforge, please see section 1.1 "Minimum ipmitool Code Level".

For more information about ipmitool, there are several good references for ipmitool commands:

The man page

The built-in command line help provides a list of IPMItool commands:
# ipmitool help

You can also get help for many specific IPMItool commands by adding the word help after the command:
# ipmitool channel help

For a list of common ipmitool commands and help on each, you may use the following link:
www.ibm.com/support/knowledgecenter/linuxonibm/liabp/liabpcommonipmi.htm

To connect to your host system with IPMI, you need to know the IP address of the server and have

a valid password. To power on the server with the ipmitool, follow these steps:

1. Open a terminal program.

2. Power on your server with the ipmitool:

ipmitool -I lanplus -H bmc_ip_address -P ipmi_password power on

3. Activate your IPMI console:

ipmitool -I lanplus -H bmc_ip_address -P ipmi_password sol activate

8.4 Petitboot bootloader

Petitboot is a kexec based bootloader used by IBM POWER9 systems for doing the bare-metal installs on the 8335 servers.

After the POWER9 system powers on, the petitboot bootloader scans local boot devices and network interfaces to find boot options that are available to the system. Petitboot returns a list of boot options that are available to the system. If you are using a static IP or if you did not provide boot arguments in your network boot server, you must provide the details to petitboot. You can configure petitboot to find your boot with the following instructions:

https://www.ibm.com/support/knowledgecenter/linuxonibm/liabp/liabppetitbootadvanced.htm

You can edit petitboot configuration options, change the amount of time before Petitboot automatically boots, etc. with these instructions:

https://www.ibm.com/support/knowledgecenter/linuxonibm/liabp/liabppetitbootconfig.htm

After you select to boot the ISO media for the Linux distribution of your choice, the installer wizard for that Linux distribution walks you through the steps to set up disk options, your root password, time zones, and so on.

You can read more about the petitboot bootloader program here:

https://www.kernel.org/pub/linux/kernel/people/geoff/petitboot/petitboot.html

9.0 Quick Start Guide for Installing Linux on the LC 8335 server

This guide helps you install Linux on Power Systems server.

Overview

Use the information found in http://www.ibm.com/support/knowledgecenter/linuxonibm/liabw/liabwkickoff.htm to install Linux on a non-virtualized (bare metal) IBM Power LC server.

10.0 Change History

Date	Description
08/14/2018	Updated for OP910.24
06/22/2018	Updated for OP910.22
05/18/2018	Updated for OP910.21, added driver level for Tesla CUDA driver
04/18/2018	Updated for OP910.20
03/22/2018	Corrections for OP910.10
01/18/2018	Updated for AC922 only for OP910.10
12/22/2017	New for LC server OP910.00 release