Warning: OMP_NUM_THREADS=6 is greater than available PU's Warning: OMP_NUM_THREADS=6 is greater than available PU's Warning: OMP_NUM_THREADS=6 is greater than available PU's Warning: OMP_NUM_THREADS=6 is greater than available PU's Warning: OMP_NUM_THREADS=6 is greater than available PU's Warning: OMP_NUM_THREADS=6 is greater than available PU's OPENMPI detected AcceleratorCudaInit: IBM Summit or similar - use default device AcceleratorCudaInit: ================================================ SharedMemoryMpi: World communicator of size 6 SharedMemoryMpi: Node communicator of size 6 SharedMemoryMpi: SharedMemoryMPI.cc cudaMalloc 536870912bytes at 0x2000e0000000 for comms buffers OPENMPI detected OPENMPI detected OPENMPI detected OPENMPI detected __|__|__|__|__|__|__|__|__|__|__|__|__|__|__ __|__|__|__|__|__|__|__|__|__|__|__|__|__|__ __|_ | | | | | | | | | | | | _|__ __|_ _|__ __|_ GGGG RRRR III DDDD _|__ __|_ G R R I D D _|__ __|_ G R R I D D _|__ __|_ G GG RRRR I D D _|__ __|_ G G R R I D D _|__ __|_ GGGG R R III DDDD _|__ __|_ _|__ __|__|__|__|__|__|__|__|__|__|__|__|__|__|__ __|__|__|__|__|__|__|__|__|__|__|__|__|__|__ | | | | | | | | | | | | | | Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, feature/gpt) clean Grid : Message : ================================================ Grid : Message : MPI is initialised and logging filters activated Grid : Message : ================================================ Grid : Message : Requested 536870912 byte stencil comms buffers Grid : Message : MemoryManager Cache 4194304000 bytes Grid : Message : MemoryManager::Init() setting up Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 16 Grid : Message : MemoryManager::Init() Non unified: Caching accelerator data in dedicated memory Grid : Message : MemoryManager::Init() Using cudaMalloc Grid : Message : 5.177200 s : Grid Default Decomposition patterns Grid : Message : 5.178600 s : OpenMP threads : 6 Grid : Message : 5.179600 s : MPI tasks : 6 1 1 1 Grid : Message : 5.181500 s : vRealF : 512bits ; 2 2 2 2 Grid : Message : 5.182900 s : vRealD : 512bits ; 1 2 2 2 Grid : Message : 5.184200 s : vComplexF : 512bits ; 1 2 2 2 Grid : Message : 5.185600 s : vComplexD : 512bits ; 1 1 2 2 ============================================= Initialized GPT Copyright (C) 2020 Christoph Lehner ============================================= OPENMPI detected GPT : 56.903302 s : : Lookup Table Benchmark with : fine fdimensions : [48, 24, 24, 24] : coarse fdimensions : [24, 12, 12, 12] : precision : single : nbasis : 40 : basis_n_block : 8 : nvec : 1 : GPT : 58.464358 s : 1000 applications of block_project : Time to complete : 1.15 s : Total performance : 2174.60 GFlops/s : Effective memory bandwidth : 2287.96 GB/s : GPT : 60.763032 s : 1000 applications of block_promote : Time to complete : 2.29 s : Total performance : 1105.64 GFlops/s : Effective memory bandwidth : 1146.20 GB/s : GPT : 60.763208 s : : Lookup Table Benchmark with : fine fdimensions : [48, 24, 24, 24] : coarse fdimensions : [24, 12, 12, 12] : precision : single : nbasis : 40 : basis_n_block : 8 : nvec : 4 : GPT : 64.629973 s : 1000 applications of block_project : Time to complete : 2.78 s : Total performance : 3589.44 GFlops/s : Effective memory bandwidth : 3776.55 GB/s : GPT : 69.946770 s : 1000 applications of block_promote : Time to complete : 5.29 s : Total performance : 1914.60 GFlops/s : Effective memory bandwidth : 1984.84 GB/s : GPT : 120.521270 s : : Lookup Table Benchmark with : fine fdimensions : [48, 24, 24, 24] : coarse fdimensions : [24, 12, 12, 12] : precision : double : nbasis : 40 : basis_n_block : 8 : nvec : 1 : GPT : 123.825127 s : 1000 applications of block_project : Time to complete : 2.92 s : Total performance : 854.27 GFlops/s : Effective memory bandwidth : 1797.59 GB/s : GPT : 129.277554 s : 1000 applications of block_promote : Time to complete : 5.43 s : Total performance : 466.74 GFlops/s : Effective memory bandwidth : 967.73 GB/s : GPT : 129.277627 s : : Lookup Table Benchmark with : fine fdimensions : [48, 24, 24, 24] : coarse fdimensions : [24, 12, 12, 12] : precision : double : nbasis : 40 : basis_n_block : 8 : nvec : 4 : GPT : 135.707373 s : 1000 applications of block_project : Time to complete : 5.13 s : Total performance : 1947.24 GFlops/s : Effective memory bandwidth : 4097.49 GB/s : GPT : 151.025009 s : 1000 applications of block_promote : Time to complete : 15.24 s : Total performance : 664.55 GFlops/s : Effective memory bandwidth : 1377.86 GB/s : ============================================= Finalized GPT ============================================= Warning: OMP_NUM_THREADS=6 is greater than available PU's Warning: OMP_NUM_THREADS=6 is greater than available PU's Warning: OMP_NUM_THREADS=6 is greater than available PU's Warning: OMP_NUM_THREADS=6 is greater than available PU's Warning: OMP_NUM_THREADS=6 is greater than available PU's Warning: OMP_NUM_THREADS=6 is greater than available PU's OPENMPI detected AcceleratorCudaInit: IBM Summit or similar - use default device AcceleratorCudaInit: ================================================ SharedMemoryMpi: World communicator of size 6 SharedMemoryMpi: Node communicator of size 6 SharedMemoryMpi: SharedMemoryMPI.cc cudaMalloc 536870912bytes at 0x2000e0000000 for comms buffers OPENMPI detected OPENMPI detected OPENMPI detected __|__|__|__|__|__|__|__|__|__|__|__|__|__|__ __|__|__|__|__|__|__|__|__|__|__|__|__|__|__ __|_ | | | | | | | | | | | | _|__ __|_ _|__ __|_ GGGG RRRR III DDDD _|__ __|_ G R R I D D _|__ __|_ G R R I D D _|__ __|_ G GG RRRR I D D _|__ __|_ G G R R I D D _|__ __|_ GGGG R R III DDDD _|__ __|_ _|__ __|__|__|__|__|__|__|__|__|__|__|__|__|__|__ __|__|__|__|__|__|__|__|__|__|__|__|__|__|__ | | | | | | | | | | | | | | Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, feature/gpt) clean Grid : Message : ================================================ Grid : Message : MPI is initialised and logging filters activated Grid : Message : ================================================ Grid : Message : Requested 536870912 byte stencil comms buffers Grid : Message : MemoryManager Cache 4194304000 bytes Grid : Message : MemoryManager::Init() setting up Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 16 Grid : Message : MemoryManager::Init() Non unified: Caching accelerator data in dedicated memory Grid : Message : MemoryManager::Init() Using cudaMalloc Grid : Message : 4.979664 s : Grid Default Decomposition patterns Grid : Message : 4.979672 s : OpenMP threads : 6 Grid : Message : 4.979682 s : MPI tasks : 6 1 1 1 Grid : Message : 4.979700 s : vRealF : 512bits ; 2 2 2 2 Grid : Message : 4.979713 s : vRealD : 512bits ; 1 2 2 2 Grid : Message : 4.979726 s : vComplexF : 512bits ; 1 2 2 2 Grid : Message : 4.979739 s : vComplexD : 512bits ; 1 1 2 2 ============================================= Initialized GPT Copyright (C) 2020 Christoph Lehner ============================================= OPENMPI detected OPENMPI detected GPT : 56.855742 s : : Lookup Table Benchmark with : fine fdimensions : [48, 24, 24, 24] : coarse fdimensions : [12, 6, 6, 6] : precision : single : nbasis : 40 : basis_n_block : 8 : nvec : 1 : GPT : 60.958615 s : 1000 applications of block_project : Time to complete : 4.01 s : Total performance : 621.65 GFlops/s : Effective memory bandwidth : 650.95 GB/s : GPT : 63.279851 s : 1000 applications of block_promote : Time to complete : 2.31 s : Total performance : 1096.59 GFlops/s : Effective memory bandwidth : 1131.43 GB/s : GPT : 63.280025 s : : Lookup Table Benchmark with : fine fdimensions : [48, 24, 24, 24] : coarse fdimensions : [12, 6, 6, 6] : precision : single : nbasis : 40 : basis_n_block : 8 : nvec : 4 : GPT : 67.588090 s : 1000 applications of block_project : Time to complete : 4.19 s : Total performance : 2384.49 GFlops/s : Effective memory bandwidth : 2496.90 GB/s : GPT : 72.630921 s : 1000 applications of block_promote : Time to complete : 5.02 s : Total performance : 2018.48 GFlops/s : Effective memory bandwidth : 2082.62 GB/s : GPT : 121.630652 s : : Lookup Table Benchmark with : fine fdimensions : [48, 24, 24, 24] : coarse fdimensions : [12, 6, 6, 6] : precision : double : nbasis : 40 : basis_n_block : 8 : nvec : 1 : GPT : 126.780396 s : 1000 applications of block_project : Time to complete : 5.02 s : Total performance : 497.48 GFlops/s : Effective memory bandwidth : 1041.87 GB/s : GPT : 132.272468 s : 1000 applications of block_promote : Time to complete : 5.46 s : Total performance : 463.37 GFlops/s : Effective memory bandwidth : 956.18 GB/s : GPT : 132.272586 s : : Lookup Table Benchmark with : fine fdimensions : [48, 24, 24, 24] : coarse fdimensions : [12, 6, 6, 6] : precision : double : nbasis : 40 : basis_n_block : 8 : nvec : 4 : GPT : 138.361205 s : 1000 applications of block_project : Time to complete : 5.57 s : Total performance : 1791.29 GFlops/s : Effective memory bandwidth : 3751.48 GB/s : GPT : 151.964885 s : 1000 applications of block_promote : Time to complete : 13.54 s : Total performance : 748.27 GFlops/s : Effective memory bandwidth : 1544.10 GB/s : ============================================= Finalized GPT =============================================