IBM Spectrum MPI 10.1.0 Fix Pack 4 Readme File

Description: Readme documentation for IBM Spectrum MPI 10.1.0 Fix Pack 4 including installation-related instructions, prerequisites, and list of fixes.

Readme file for: IBM Spectrum MPI

Product/Component Release: 10.1.0

Update Name: Fix pack 4

Fix ID: Spectrum_MPI_10.01.00.04

Publication date: 29 August 2017

Last modified date: 29 August 2017

1. Product website

View the IBM Spectrum MPI 10.1.0 Fix Pack 4 website (http://www.ibm.com/systems/spectrum-computing/products/mpi/index.html).

2. Products or components affected

IBM Spectrum MPI

3. System requirements

3.1 IBM Spectrum MPI for x86_64 Linux

3.2 IBM Spectrum MPI for Power 8 Linux

4. Installation and configuration

4.1 Before installation

None.

4.2 Installation steps

IBM Spectrum MPI must be installed on all machines in the same directory or be accessible through the same shared network path. The following describes the process of installing the product using the RPM toolset.

Step 1: Obtain software packages.

Step 2: You must have root authority to install the package.

Step 3: Install RPMs and accept the license packages.

  1. You must install and accept the IBM Spectrum MPI on each node of a cluster in order to successfully install IBM Spectrum MPI. The IBM Spectrum MPI license must be installed either before you install the IBM Spectrum MPI component base RPM, or at the same time. See the following instructions for installing the packages concurrently:
    1. Select the appropriate RPMs for the platform you are using.
    2. Login as root.
    3. Determine how the IBM Spectrum MPI license will be processed. For unattended installation, the license may be accepted automatically when the license RPM is installed as follows:
      • Set the environment variable IBM_SPECTRUM_MPI_LICENSE_ACCEPT=yes.

To review the license terms and manually accept the license:

      • Set the environment variable IBM_SPECTRUM_MPI_LICENSE_ACCEPT=no.

 

    1. Use the rpm -i command to install the IBM Spectrum MPI product and license RPMs. By default, both product and license files will be installed to the /opt/ibm/spectrum_mpi directory. This may be changed by providing an alternate installation directory to the rpm-i command with the - - prefix flag. If - - prefix is used, the same alternate location must be given for both the product and license RPMS.

4.3 After installation

If you did not set the environment variable IBM_SPECTRUM_MPI_LICENSE_ACCEPT=yes, the license RPM installation will print the location of a license acceptance script that must be run to review and accept the license.

4.4 Uninstalling

None.

5. List of changes

5.1 Changes in IBM Spectrum MPI 10.1 Fix Pack 4

Spectrum MPI one-sided operations are not thread safe.  Multi-threaded applications that require MPI_THREAD_MULTIPLE that use MPI one-sided operations may experience silent data corruption. 

 

Spectrum MPI 10.1.0.4 has disabled all MPI one-sided communication for jobs that initialize the MPI execution environment using MPI_THREAD_MULTIPLE.  This is done at the MPI_Init_thread() API, when MPI_THREAD_MULTIPLE is requested. 

 

Users of Spectrum MPI 10.1.0.4 who use MPI_Init_thread() with MPI_THREAD_MILTIPLE will see the following error at the first call to MPI_Win_create():

[<host>,<pid>] *** An error occurred in MPI_Win_create
[<host>,<pid>] *** reported by process [3497459713,0]
[<host>,<pid>] *** on communicator MPI_COMM_WORLD
[<host>,<pid>] *** MPI_ERR_WIN: invalid window
[<host>,<pid>] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[<host>,<pid>] ***    and potentially your MPI job)

There are no known issues using MPI_THREAD_SINGLE, MPI_THREAD_FUNNELED, or MPI_THREAD_SERIALIZED.

The MPI standard defines several APIs to allow for synchronization between the different MPI processes involved in a one-sided operation.  In some cases, Spectrum MPI with the PAMI interconnect protocol, the one-sided synchronization calls may return early, before the one-sided remote operation is complete. The early return is intermittent, and not reliably reproducible in all environments. 

 

The following MPI APIs may experience an early return, before the remote operation is complete. 

 

o   MPI_Win_fence()

o   MPI_Win_unlock()

o   MPI_Win_complete()

o   MPI_Win_wait()

 

Silent data corruption will occur with IBM Spectrum MPI when using MPI_Reduce_scatter(), MPI_Ireduce_scatter(), MPI_Reduce_scatter_block(), and/or MPI_Ireduce_scatter_block(), for message buffer size of 8K bytes or larger, when the datatype size and the datatype extents are different. 

The datatype size is the amount of data that will be transferred by the MPI collective operation.  This is the amount of memory, in bytes, that will be consumed by one instance of a datatype.

The datatype extent is the amount of memory consumed from the first byte to the last byte occupied by one entry of this datatype.  The datatype extent may contain padding space between successive elements to allow for more optimal memory placement of the datatype. 

 

User defined non-contiguous datatypes are the most common situation when the datatype size and the datatype extent do not match.

 

As a workaround with previous versions of IBM Spectrum MPI 10.1.0.x, the issue can be avoided by disabling the specific collective algorithm.  Add this to the mpirun command line.  This command will avoid the issue for both the blocking and non-blocking MPI APIs:

 

-mca coll_ibm_skip_reduce_scatter true -mca coll_ibm_skip_ireduce_scatter true

Silent data corruption may occur when using the MPI_lreduce non-blocking collective, with 4 or fewer ranks participating in the collective, a message size 64kb or larger, and with MPI_IN_PLACE.

 

Customers will experience silent data corruption using the MPI_Ireduce collective with Spectrum MPI when the following conditions are met:

 

o   MPI_lreduce collective

o   Using the MPI_IN_PLACE directive

o   Rank count in the collective is less than or equal to 4

o   The message size is 64kb or larger

o   The collective is selected from the libnbc collective library

 

Silent data corruption may occur for Spectrum MPI applications that use the one-sided operations MPI_Put() or MPI_Accumulate() with user-defined non-contiguous datatypes on Power systems. 

This error only affects applications running on Power (ppc64le) systems.  Applications running on x86_64 systems are not affected.

Spectrum MPI uses the “-pami” interconnect protocol by default.  The Mellanox fabric collectives must be explicitly requested using the mpirun command line option “-HCOLL/-hcoll” or “-FCA/-fca”.  In these cases, some MPI_Reduce() operations can result in silent data corruption. 

 

As a workaround with previous versions of IBM Spectrum MPI 10.1.0.x, users can add “-x HCOLL_ML_DISABLE_REDUCE=1” to the mpirun command line. This will disable the HCOLL MPI_Reduce() algorithm, and cause another MPI_Reduce() algorithm to be selected to complete the transaction. 

MPI_Gatherv(), MPI_Igatherv(), MPI_Allgatherv(), MPI_Iallgatherv(), MPI_Alltoallv(), and/or MPI_Ialltoallv() all pack data before transmission and unpack data after transmission.  User defined datatypes with a true lower bound that is not zero are not unpacked correctly.  This may cause silent data corruption in the recv side buffer, or may result in segmentation violations (e.g. segfault).

 

As a workaround with previous versions of IBM Spectrum MPI 10.1.0.x, the specific IBM collective algorithms may be disabled.  This will avoid the issue.  To disable coll/ibm collectives in a single mpirun command line option:

-mca coll ^ibm

 

Disable just those specific collectives (blocking and nonblocking) affected by the issue.  This is the preferred method:

-mca coll_ibm_skip_gatherv t[rue]
-mca coll_ibm_skip_allgatherv t[rue]
-mca coll_ibm_skip_alltoallv t[rue]

-mca coll_ibm_skip_igatherv t[rue]
-mca coll_ibm_skip_iallgatherv t[rue]
-mca coll_ibm_skip_ialltoallv t[rue]

 

When MPI_Type_create_subarray() is used with a list of distribs[] arguments that contains both MPI_DISTRIBUTE_BLOCK and MPI_DISTRIBUTE_CYCLIC, the resulting datatype may have incorrect offsets and thus produce silent data corruption of data transferred with MPI calls that use the effected subarray. 

The datatype layout is incorrect, causing data to be placed in the wrong memory offsets.  

MPI_Accumulate() is used to combine data that is received from a remote process, with data that already exists in a local process.  One typical programming model is to allow multiple remote ranks to all MPI_Accumulate() data into a single target buffer. 

 

In some cases, silent data corruption can occur when multiple ranks use MPI_Accumulate() to combine data in a single target buffer.  This silent data corruption only occurs with the RDMA one sided component (osc=rdma) is used.

 

This issue only applies to supported options on x86_64 systems.

 

The issue is not present in pami (osc=pami component) or in the point to point (osc=pt2pt component). 

 

There is no workaround for this issue when using the osc=rdma component

Silent data corruption may occur when using MPI_Recv, MPI_Irevc, MPI_Sendrecv, MPI_Recv_init, MPI_Mrecv and/or MPI_Imrecv to receive data into a user defined non-contiguous receive buffer that is allocated on the GPU.  The gaps in the receive data buffer may be overwritten during the data transfer. 

 

5.2 Changes in IBM Spectrum MPI 10.1 Fix Pack 3

Silent data corruption may occur when using the MPI_lreduce non-blocking collective, with 4 or fewer ranks participating in the collective, a message size 64kb or larger, and with MPI_IN_PLACE.

 

Customers will experience silent data corruption using the MPI_Ireduce collective with Spectrum MPI when the following conditions are met:

 

o   MPI_lreduce collective

o   Using the MPI_IN_PLACE directive

o   Rank count in the collective is less than or equal to 4

o   The message size is 64kb or larger

o   The collective is selected from the libnbc collective library

5.3 Changes in IBM Spectrum MPI 10.1 Fix Pack 2

·      Update to Open MPI 2.0.1

·      Add support for STAT Debugger

·      Add support for PGI Compilers

·      Add support for Mellanox HCOLL on RH 7.3 with MOFED 3.4 (Power only)

·      Add support for usNIC (x86_64 only)

·      Add support for PSM2 on RH 7 and SLES 12 (x86_64 only)

·      Add “-pami_noib” option to allow PAMI shmem use on single node without InfiniBand (Power only)

·      Add check for license acceptance in $MPI_ROOT and default install location (/opt/ibm/spectrum_mpi)

·      Add support for Dynamic Connect Transport

·      Add support for non-blocking collectives with CUDA Aware

·      Fix issue with serial CUDA jobs

·      Fix Epoll ADD failure

·      Fix "mpirun --debug" option

·      Fix “–stdio file” option

·      Fix LD_PRELOAD to honor user settings

·      Fix shmem object leaks in PAMI

·      General improvement for IBM libcoll

·      General improvements for RMA (1sided) operations

 

5.4 Changes in IBM Spectrum MPI 10.1 Fix Pack 1

In RMA operations when a communication epoch is closed with one of the following synchronization APIs, the application could have silent data corruption of the receive side memory buffers:

6. Copyright and trademark information

Copyright IBM Corporation 2017 U.S. Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.

IBM, the IBM logo and ibm.com are trademarks of International Business Machines Corp., registered in many jurisdictions worldwide. Other product and service names might be trademarks of IBM or other companies. A current list of IBM trademarks is available on the Web at "Copyright and trademark information" at www.ibm.com/legal/copytrade.shtml.