# srun -N1 -n4 benchmarks/matrix_multiply.py --shm 2048 --device-mem 16000 --grid 48.48.48.128 --mpi 4.1.1.1 --accelerator-threads 8 --Ls 12
OPENMPI detected
AcceleratorCudaInit: using default device 
AcceleratorCudaInit: assume user either uses a) IBM jsrun, or 
AcceleratorCudaInit: b) invokes through a wrapping script to set CUDA_VISIBLE_DEVICES, UCX_NET_DEVICES, and numa binding 
AcceleratorCudaInit: Configure options --enable-summit, --enable-select-gpu=no 
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 4
SharedMemoryMpi:  Node  communicator of size 4
0SharedMemoryMpi:  SharedMemoryMPI.cc acceleratorAllocDevice 2147483648bytes at 0x148120000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=9c9566b9c9686a63755f39bb8910cf5325ef7177: (HEAD -> feature/gpt, origin/feature/gpt, origin/HEAD) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 2147483648 byte stencil comms buffers 
Grid : Message : MemoryManager Cache 16777216000 bytes 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Non unified: Caching accelerator data in dedicated memory
Grid : Message : MemoryManager::Init() Using cudaMalloc

=============================================
              Initialized GPT                
     Copyright (C) 2020 Christoph Lehner     
=============================================
GPT :       1.589357 s : 
                       : Matrix Multiply Benchmark with
                       :     fdimensions  : [48, 48, 48, 128]
                       :     precision    : single
                       : 
GPT :      10.985099 s : 10 matrix_multiply
                       :     Object type                 : ot_matrix_color(3)
                       :     Time to complete            : 0.0058 s
                       :     Effective memory bandwidth  : 5271.36 GB/s
                       : 
GPT :      16.689329 s : 10 matrix_multiply
                       :     Object type                 : ot_matrix_spin(4)
                       :     Time to complete            : 0.01 s
                       :     Effective memory bandwidth  : 5333.21 GB/s
                       : 
GPT :      62.092583 s : 10 matrix_multiply
                       :     Object type                 : ot_matrix_spin_color(4,3)
                       :     Time to complete            : 0.097 s
                       :     Effective memory bandwidth  : 5057.37 GB/s
                       : 
GPT :      62.262581 s : 
                       : Matrix Multiply Benchmark with
                       :     fdimensions  : [48, 48, 48, 128]
                       :     precision    : double
                       : 
GPT :      72.003471 s : 10 matrix_multiply
                       :     Object type                 : ot_matrix_color(3)
                       :     Time to complete            : 0.012 s
                       :     Effective memory bandwidth  : 5264.01 GB/s
                       : 
GPT :      78.174681 s : 10 matrix_multiply
                       :     Object type                 : ot_matrix_spin(4)
                       :     Time to complete            : 0.02 s
                       :     Effective memory bandwidth  : 5439.91 GB/s
                       : 
GPT :     128.232979 s : 10 matrix_multiply
                       :     Object type                 : ot_matrix_spin_color(4,3)
                       :     Time to complete            : 0.22 s
                       :     Effective memory bandwidth  : 4416.45 GB/s
                       : 
=============================================
               Finalized GPT                 
=============================================