Total time spent for program execution including the idle times of CPUs
reserved for slave threads during OpenMP sequential execution. This pattern
assumes that every thread of a process allocated a separate CPU during the
entire runtime of the process.
Time spent on program execution but without the idle times of slave threads
during OpenMP sequential execution. For pure MPI applications, this pattern
is equal to Time.
Time spent performing major tasks related to trace generation, such as time
synchronization or dumping the trace-buffer contents to a file. Note that
the normal per-event overhead is not included.
This pattern covers the time spent waiting in front of an MPI barrier,
which is the time inside the barrier call until the last processes has
reached the barrier. A large amount of waiting time spent in front of
barriers can be an indication of load imbalance.
Refers to the time lost waiting caused by a blocking receive operation
(e.g., MPI_Recv() or MPI_Wait()) that is posted earlier than the
corresponding send operation.
If the receiving process is waiting for multiple messages to arrive (e.g.,
in an call to MPI_Waitall()), the maximum waiting time is accounted, i.e.,
the waiting time due to the latest sender.
A Late Sender situation may be the result of messages that are
received in the wrong order. If a process expects messages from one or more
processes in a certain order, although these processes are sending them in
a different order, the receiver may need to wait for a message if it tries
to receive a message early that has been sent late. This situation can
eventually be avoided by receiving messages in the order in which they are
sent instead. This pattern refers to the time spent in a wait state as a
result of this situation.
This pattern comes in two different flavors:
The messages involved were sent from the same source location
The messages involved were sent from different source locations
See the description of the corresponding specializations for more details.
A send operation may be blocked until the corresponding receive operation
is called. This can happen for several reasons. Either the MPI
implementation is working in synchronous mode by default or the size of the
message to be sent exceeds the available MPI-internal buffer space and the
operation is blocked until the data can be transferred to the receiver. The
pattern refers to the time spent waiting as a result of this situation.
Note that this pattern does currently not apply to nonblocking sends
waiting in the corresponding completion call, e.g., MPI_Wait().
Collective communication operations that send data from all processes to
one destination process (i.e., n-to-1) may suffer from waiting times if the
destination process enters the operation earlier than its sending
counterparts, that is, before any data could have been sent. The pattern
refers to the time lost as a result of this situation. It applies to the
MPI calls MPI_Reduce(), MPI_Gather() and MPI_Gatherv().
MPI_Scan operations may suffer from waiting times if the process with rank
n enters the operation earlier than its sending counterparts (i.e.,
ranks 0..n-1). The pattern refers to the time lost as a result of
this situation.
Collective communication operations that send data from one source process
to all processes (i.e., 1-to-n) may suffer from waiting times if
destination processes enter the operation earlier than the source process,
that is, before any data could have been sent. The pattern refers to the
time lost as a result of this situation. It applies to the MPI calls
MPI_Bcast(), MPI_Scatter() and MPI_Scatterv().
Collective communication operations that send data from all processes to
all processes (i.e., n-to-n) exhibit an inherent synchronization among all
participants, that is, no process can finish the operation until the last
process has started it. This pattern covers the time spent in n-to-n
operations until all processes have reached it. It applies to the MPI calls
MPI_Reduce_scatter(), MPI_Allgather(), MPI_Allgatherv(), MPI_Allreduce(),
MPI_Alltoall(), MPI_Alltoallv().
Note that the time reported by this pattern is not necessarily completely
waiting time since some processes could -- at least theoretically
-- already communicate with each other while others have not yet entered
the operation.
This pattern refers to the time spent in MPI n-to-n collectives after the
first process has left the operation.
Note that the time reported by this pattern is not necessarily completely
waiting time since some processes could -- at least theoretically -- still
communicate with each other while others have already finished
communicating and exited the operation.
Idle time on CPUs that may be reserved for teams of threads when the
process is executing sequentially before and after OpenMP parallel regions,
or with less than the full team within OpenMP parallel regions.
Time spent in implicit (compiler-generated)
or explicit (user-specified) OpenMP barrier synchronization. Note that
during measurement implicit barriers are treated similar to explicit
ones. The instrumentation procedure replaces an implicit barrier with an
explicit barrier enclosed by the parallel construct. This is done by
adding a nowait clause and a barrier directive as the last statement of
the parallel construct. In cases where the implicit barrier cannot be
removed (i.e., parallel region), the explicit barrier is executed in
front of the implicit barrier, which will then be negligible because the
team will already be synchronized when reaching it. The synthetic
explicit barrier appears as a special implicit barrier construct.
This metric provides the total number of MPI synchronization operations
that were executed. This does not only include barrier calls, but also
communication operations which transfer no data (i.e., zero-sized messages
are considered to be used for synchronization).
Provides the number of MPI collective synchronization operations. This does
not only include barrier calls, but also calls to collective communication
operations that are neither sending nor receiving any data.
Provides the total number of Late Sender instances found in
communication operations were messages were sent in wrong order (see also
Messages in Wrong Order).
Provides the total number of Late Sender instances found in
synchronization operations (i.e., zero-sized message transfers) were
messages are received in wrong order (see also Messages in Wrong Order).
This simple heuristic allows to identify computational load imbalances and
is calculated for each (call-path, process/thread) pair. Its value
represents the absolute difference to the average exclusive execution
time. This average value is the aggregated exclusive time spent by all
processes/threads in this call-path, divided by the number of
processes/threads visiting it.
Note:
A high value for a collapsed call tree node does not necessarily mean that
there is a load imbalance in this particular node, but the imbalance can
also be somewhere in the subtree underneath.
This metric is provided as a convenience to identify processes/threads
were the exclusive execution time spent for a particular call tree node
was below the average value.
This metric is provided as a convenience to identify processes/threads
were the exclusive execution time spent for a particular call tree node
was above the average value.