AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 2.134501 s : Grid Default Decomposition patterns
Grid : Message : 2.134505 s : 	OpenMP threads : 160
Grid : Message : 2.134516 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 2.134530 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 2.134535 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 2.134539 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 2.134544 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 2.206906 s : Lookup Table Benchmark with
Grid : Message : 2.206916 s :     fine fdimensions    : [8 8 8 8 ]
Grid : Message : 2.206924 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 2.206932 s :     precision           : single
Grid : Message : 2.206938 s :     nbasis              : 10

Grid : Message : 2.264911 s : Recalculation of coarsening lookup table finished
Grid : Message : 4.163553 s : 1000 applications of vectorizableBlockProject
Grid : Message : 4.163561 s :     Time to complete            : 1.59647 s
Grid : Message : 4.163629 s :     Total performance           : 2.41173 GFlops/s
Grid : Message : 4.163634 s :     Effective memory bandwidth  : 2.72217 GB/s

Grid : Message : 4.163643 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 4.163646 s :     Time to complete            : 0.123358 s
Grid : Message : 4.163650 s :     Total performance           : 31.2119 GFlops/s
Grid : Message : 4.163656 s :     Effective memory bandwidth  : 35.2296 GB/s

Grid : Message : 4.163665 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 4.163668 s :     Time to complete            : 0.166449 s
Grid : Message : 4.163672 s :     Total performance           : 23.1316 GFlops/s
Grid : Message : 4.163676 s :     Effective memory bandwidth  : 26.1092 GB/s

Grid : Message : 4.165261 s : Lookup Table Benchmark with
Grid : Message : 4.165266 s :     fine fdimensions    : [8 8 8 8 ]
Grid : Message : 4.165270 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 4.165274 s :     precision           : double
Grid : Message : 4.165277 s :     nbasis              : 10

Grid : Message : 4.183408 s : Recalculation of coarsening lookup table finished
Grid : Message : 6.144626 s : 1000 applications of vectorizableBlockProject
Grid : Message : 6.144634 s :     Time to complete            : 1.51112 s
Grid : Message : 6.144645 s :     Total performance           : 2.54793 GFlops/s
Grid : Message : 6.144649 s :     Effective memory bandwidth  : 5.75182 GB/s

Grid : Message : 6.144659 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 6.144662 s :     Time to complete            : 0.135462 s
Grid : Message : 6.144666 s :     Total performance           : 28.423 GFlops/s
Grid : Message : 6.144670 s :     Effective memory bandwidth  : 64.1635 GB/s

Grid : Message : 6.144679 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 6.144682 s :     Time to complete            : 0.302332 s
Grid : Message : 6.144686 s :     Total performance           : 12.7351 GFlops/s
Grid : Message : 6.144690 s :     Effective memory bandwidth  : 28.7489 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.549846 s : Grid Default Decomposition patterns
Grid : Message : 0.549850 s : 	OpenMP threads : 160
Grid : Message : 0.549861 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.549875 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.549880 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.549885 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.549890 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.555747 s : Lookup Table Benchmark with
Grid : Message : 0.555753 s :     fine fdimensions    : [8 8 8 8 ]
Grid : Message : 0.555758 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 0.555762 s :     precision           : single
Grid : Message : 0.555765 s :     nbasis              : 20

Grid : Message : 0.578807 s : Recalculation of coarsening lookup table finished
Grid : Message : 4.110849 s : 1000 applications of vectorizableBlockProject
Grid : Message : 4.110857 s :     Time to complete            : 3.11977 s
Grid : Message : 4.110899 s :     Total performance           : 2.46829 GFlops/s
Grid : Message : 4.110903 s :     Effective memory bandwidth  : 2.65997 GB/s

Grid : Message : 4.110912 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 4.110915 s :     Time to complete            : 0.167337 s
Grid : Message : 4.110919 s :     Total performance           : 46.0178 GFlops/s
Grid : Message : 4.110925 s :     Effective memory bandwidth  : 49.5915 GB/s

Grid : Message : 4.110934 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 4.110937 s :     Time to complete            : 0.223631 s
Grid : Message : 4.110941 s :     Total performance           : 34.4339 GFlops/s
Grid : Message : 4.110945 s :     Effective memory bandwidth  : 37.108 GB/s

Grid : Message : 4.112810 s : Lookup Table Benchmark with
Grid : Message : 4.112815 s :     fine fdimensions    : [8 8 8 8 ]
Grid : Message : 4.112819 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 4.112823 s :     precision           : double
Grid : Message : 4.112826 s :     nbasis              : 20

Grid : Message : 4.139416 s : Recalculation of coarsening lookup table finished
Grid : Message : 7.756987 s : 1000 applications of vectorizableBlockProject
Grid : Message : 7.756996 s :     Time to complete            : 3.01101 s
Grid : Message : 7.757008 s :     Total performance           : 2.55744 GFlops/s
Grid : Message : 7.757012 s :     Effective memory bandwidth  : 5.51209 GB/s

Grid : Message : 7.757022 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 7.757025 s :     Time to complete            : 0.254674 s
Grid : Message : 7.757029 s :     Total performance           : 30.2366 GFlops/s
Grid : Message : 7.757033 s :     Effective memory bandwidth  : 65.1696 GB/s

Grid : Message : 7.757042 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 7.757045 s :     Time to complete            : 0.331229 s
Grid : Message : 7.757049 s :     Total performance           : 23.2482 GFlops/s
Grid : Message : 7.757053 s :     Effective memory bandwidth  : 50.1073 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.527087 s : Grid Default Decomposition patterns
Grid : Message : 0.527091 s : 	OpenMP threads : 160
Grid : Message : 0.527102 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.527116 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.527121 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.527126 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.527130 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.532970 s : Lookup Table Benchmark with
Grid : Message : 0.532977 s :     fine fdimensions    : [8 8 8 8 ]
Grid : Message : 0.532981 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 0.532985 s :     precision           : single
Grid : Message : 0.532988 s :     nbasis              : 30

Grid : Message : 0.561198 s : Recalculation of coarsening lookup table finished
Grid : Message : 5.682362 s : 1000 applications of vectorizableBlockProject
Grid : Message : 5.682370 s :     Time to complete            : 4.61071 s
Grid : Message : 5.682413 s :     Total performance           : 2.50519 GFlops/s
Grid : Message : 5.682417 s :     Effective memory bandwidth  : 2.6571 GB/s

Grid : Message : 5.682426 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 5.682429 s :     Time to complete            : 0.116899 s
Grid : Message : 5.682433 s :     Total performance           : 98.8094 GFlops/s
Grid : Message : 5.682439 s :     Effective memory bandwidth  : 104.801 GB/s

Grid : Message : 5.682448 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 5.682451 s :     Time to complete            : 0.365688 s
Grid : Message : 5.682455 s :     Total performance           : 31.5863 GFlops/s
Grid : Message : 5.682459 s :     Effective memory bandwidth  : 33.5016 GB/s

Grid : Message : 5.684562 s : Lookup Table Benchmark with
Grid : Message : 5.684567 s :     fine fdimensions    : [8 8 8 8 ]
Grid : Message : 5.684571 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 5.684575 s :     precision           : double
Grid : Message : 5.684578 s :     nbasis              : 30

Grid : Message : 5.720337 s : Recalculation of coarsening lookup table finished
Grid : Message : 10.970638 s : 1000 applications of vectorizableBlockProject
Grid : Message : 10.970646 s :     Time to complete            : 4.4878 s
Grid : Message : 10.970659 s :     Total performance           : 2.57381 GFlops/s
Grid : Message : 10.970663 s :     Effective memory bandwidth  : 5.45975 GB/s

Grid : Message : 10.970673 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 10.970676 s :     Time to complete            : 0.14566 s
Grid : Message : 10.970681 s :     Total performance           : 79.2992 GFlops/s
Grid : Message : 10.970685 s :     Effective memory bandwidth  : 168.216 GB/s

Grid : Message : 10.970694 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 10.970697 s :     Time to complete            : 0.585972 s
Grid : Message : 10.970701 s :     Total performance           : 19.7121 GFlops/s
Grid : Message : 10.970706 s :     Effective memory bandwidth  : 41.8147 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.528020 s : Grid Default Decomposition patterns
Grid : Message : 0.528024 s : 	OpenMP threads : 160
Grid : Message : 0.528036 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.528049 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.528054 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.528059 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.528063 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.534033 s : Lookup Table Benchmark with
Grid : Message : 0.534040 s :     fine fdimensions    : [8 8 8 8 ]
Grid : Message : 0.534044 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 0.534048 s :     precision           : single
Grid : Message : 0.534051 s :     nbasis              : 40

Grid : Message : 0.570276 s : Recalculation of coarsening lookup table finished
Grid : Message : 7.112360 s : 1000 applications of vectorizableBlockProject
Grid : Message : 7.112369 s :     Time to complete            : 6.15402 s
Grid : Message : 7.112414 s :     Total performance           : 2.50258 GFlops/s
Grid : Message : 7.112419 s :     Effective memory bandwidth  : 2.63304 GB/s

Grid : Message : 7.112429 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 7.112432 s :     Time to complete            : 0.110743 s
Grid : Message : 7.112437 s :     Total performance           : 139.069 GFlops/s
Grid : Message : 7.112443 s :     Effective memory bandwidth  : 146.319 GB/s

Grid : Message : 7.112453 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 7.112456 s :     Time to complete            : 0.241412 s
Grid : Message : 7.112460 s :     Total performance           : 63.7953 GFlops/s
Grid : Message : 7.112464 s :     Effective memory bandwidth  : 67.1208 GB/s

Grid : Message : 7.115076 s : Lookup Table Benchmark with
Grid : Message : 7.115081 s :     fine fdimensions    : [8 8 8 8 ]
Grid : Message : 7.115086 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 7.115090 s :     precision           : double
Grid : Message : 7.115093 s :     nbasis              : 40

Grid : Message : 7.162208 s : Recalculation of coarsening lookup table finished
Grid : Message : 13.717407 s : 1000 applications of vectorizableBlockProject
Grid : Message : 13.717417 s :     Time to complete            : 5.99698 s
Grid : Message : 13.717429 s :     Total performance           : 2.56812 GFlops/s
Grid : Message : 13.717433 s :     Effective memory bandwidth  : 5.40398 GB/s

Grid : Message : 13.717444 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 13.717447 s :     Time to complete            : 0.132424 s
Grid : Message : 13.717451 s :     Total performance           : 116.3 GFlops/s
Grid : Message : 13.717455 s :     Effective memory bandwidth  : 244.726 GB/s

Grid : Message : 13.717464 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 13.717467 s :     Time to complete            : 0.343933 s
Grid : Message : 13.717471 s :     Total performance           : 44.779 GFlops/s
Grid : Message : 13.717475 s :     Effective memory bandwidth  : 94.2264 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.528236 s : Grid Default Decomposition patterns
Grid : Message : 0.528241 s : 	OpenMP threads : 160
Grid : Message : 0.528252 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.528266 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.528271 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.528276 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.528281 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.534071 s : Lookup Table Benchmark with
Grid : Message : 0.534077 s :     fine fdimensions    : [12 12 12 12 ]
Grid : Message : 0.534081 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 0.534085 s :     precision           : single
Grid : Message : 0.534088 s :     nbasis              : 10

Grid : Message : 0.572611 s : Recalculation of coarsening lookup table finished
Grid : Message : 4.381671 s : 1000 applications of vectorizableBlockProject
Grid : Message : 4.381681 s :     Time to complete            : 2.92585 s
Grid : Message : 4.381729 s :     Total performance           : 6.66193 GFlops/s
Grid : Message : 4.381733 s :     Effective memory bandwidth  : 7.49104 GB/s

Grid : Message : 4.381744 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 4.381747 s :     Time to complete            : 0.33272 s
Grid : Message : 4.381751 s :     Total performance           : 58.5833 GFlops/s
Grid : Message : 4.381757 s :     Effective memory bandwidth  : 65.8743 GB/s

Grid : Message : 4.381766 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 4.381769 s :     Time to complete            : 0.528615 s
Grid : Message : 4.381773 s :     Total performance           : 36.8734 GFlops/s
Grid : Message : 4.381777 s :     Effective memory bandwidth  : 41.4625 GB/s

Grid : Message : 4.385745 s : Lookup Table Benchmark with
Grid : Message : 4.385750 s :     fine fdimensions    : [12 12 12 12 ]
Grid : Message : 4.385754 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 4.385758 s :     precision           : double
Grid : Message : 4.385761 s :     nbasis              : 10

Grid : Message : 4.420859 s : Recalculation of coarsening lookup table finished
Grid : Message : 8.158726 s : 1000 applications of vectorizableBlockProject
Grid : Message : 8.158735 s :     Time to complete            : 2.77043 s
Grid : Message : 8.158747 s :     Total performance           : 7.03568 GFlops/s
Grid : Message : 8.158752 s :     Effective memory bandwidth  : 15.8226 GB/s

Grid : Message : 8.158762 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 8.158765 s :     Time to complete            : 0.363687 s
Grid : Message : 8.158769 s :     Total performance           : 53.5951 GFlops/s
Grid : Message : 8.158773 s :     Effective memory bandwidth  : 120.531 GB/s

Grid : Message : 8.158782 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 8.158785 s :     Time to complete            : 0.580393 s
Grid : Message : 8.158789 s :     Total performance           : 33.5839 GFlops/s
Grid : Message : 8.158793 s :     Effective memory bandwidth  : 75.5271 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.528416 s : Grid Default Decomposition patterns
Grid : Message : 0.528420 s : 	OpenMP threads : 160
Grid : Message : 0.528431 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.528450 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.528455 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.528460 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.528465 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.534298 s : Lookup Table Benchmark with
Grid : Message : 0.534304 s :     fine fdimensions    : [12 12 12 12 ]
Grid : Message : 0.534308 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 0.534312 s :     precision           : single
Grid : Message : 0.534315 s :     nbasis              : 20

Grid : Message : 0.594369 s : Recalculation of coarsening lookup table finished
Grid : Message : 7.360992 s : 1000 applications of vectorizableBlockProject
Grid : Message : 7.361002 s :     Time to complete            : 5.76043 s
Grid : Message : 7.361050 s :     Total performance           : 6.7675 GFlops/s
Grid : Message : 7.361055 s :     Effective memory bandwidth  : 7.26417 GB/s

Grid : Message : 7.361066 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 7.361069 s :     Time to complete            : 0.383309 s
Grid : Message : 7.361073 s :     Total performance           : 101.703 GFlops/s
Grid : Message : 7.361077 s :     Effective memory bandwidth  : 109.167 GB/s

Grid : Message : 7.361086 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 7.361089 s :     Time to complete            : 0.584442 s
Grid : Message : 7.361093 s :     Total performance           : 66.7024 GFlops/s
Grid : Message : 7.361097 s :     Effective memory bandwidth  : 71.5978 GB/s

Grid : Message : 7.365545 s : Lookup Table Benchmark with
Grid : Message : 7.365550 s :     fine fdimensions    : [12 12 12 12 ]
Grid : Message : 7.365554 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 7.365558 s :     precision           : double
Grid : Message : 7.365561 s :     nbasis              : 20

Grid : Message : 7.425598 s : Recalculation of coarsening lookup table finished
Grid : Message : 14.514070 s : 1000 applications of vectorizableBlockProject
Grid : Message : 14.514270 s :     Time to complete            : 5.49693 s
Grid : Message : 14.514410 s :     Total performance           : 7.0919 GFlops/s
Grid : Message : 14.514450 s :     Effective memory bandwidth  : 15.2248 GB/s

Grid : Message : 14.514550 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 14.514580 s :     Time to complete            : 0.416631 s
Grid : Message : 14.514620 s :     Total performance           : 93.5688 GFlops/s
Grid : Message : 14.514660 s :     Effective memory bandwidth  : 200.872 GB/s

Grid : Message : 14.514750 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 14.514780 s :     Time to complete            : 0.66994 s
Grid : Message : 14.514820 s :     Total performance           : 58.1898 GFlops/s
Grid : Message : 14.514860 s :     Effective memory bandwidth  : 124.921 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.527620 s : Grid Default Decomposition patterns
Grid : Message : 0.527624 s : 	OpenMP threads : 160
Grid : Message : 0.527635 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.527649 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.527654 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.527658 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.527663 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.533426 s : Lookup Table Benchmark with
Grid : Message : 0.533432 s :     fine fdimensions    : [12 12 12 12 ]
Grid : Message : 0.533436 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 0.533440 s :     precision           : single
Grid : Message : 0.533443 s :     nbasis              : 30

Grid : Message : 0.613796 s : Recalculation of coarsening lookup table finished
Grid : Message : 10.368039 s : 1000 applications of vectorizableBlockProject
Grid : Message : 10.368048 s :     Time to complete            : 8.69091 s
Grid : Message : 10.368098 s :     Total performance           : 6.72835 GFlops/s
Grid : Message : 10.368103 s :     Effective memory bandwidth  : 7.10763 GB/s

Grid : Message : 10.368112 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 10.368115 s :     Time to complete            : 0.408735 s
Grid : Message : 10.368119 s :     Total performance           : 143.065 GFlops/s
Grid : Message : 10.368123 s :     Effective memory bandwidth  : 151.129 GB/s

Grid : Message : 10.368132 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 10.368135 s :     Time to complete            : 0.599906 s
Grid : Message : 10.368139 s :     Total performance           : 97.4745 GFlops/s
Grid : Message : 10.368143 s :     Effective memory bandwidth  : 102.969 GB/s

Grid : Message : 10.373064 s : Lookup Table Benchmark with
Grid : Message : 10.373069 s :     fine fdimensions    : [12 12 12 12 ]
Grid : Message : 10.373073 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 10.373077 s :     precision           : double
Grid : Message : 10.373080 s :     nbasis              : 30

Grid : Message : 10.453943 s : Recalculation of coarsening lookup table finished
Grid : Message : 19.972811 s : 1000 applications of vectorizableBlockProject
Grid : Message : 19.972821 s :     Time to complete            : 8.20295 s
Grid : Message : 19.972834 s :     Total performance           : 7.12859 GFlops/s
Grid : Message : 19.972839 s :     Effective memory bandwidth  : 15.0609 GB/s

Grid : Message : 19.972849 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 19.972852 s :     Time to complete            : 0.46821 s
Grid : Message : 19.972856 s :     Total performance           : 124.892 GFlops/s
Grid : Message : 19.972860 s :     Effective memory bandwidth  : 263.864 GB/s

Grid : Message : 19.972869 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 19.972872 s :     Time to complete            : 0.789822 s
Grid : Message : 19.972876 s :     Total performance           : 74.0363 GFlops/s
Grid : Message : 19.972880 s :     Effective memory bandwidth  : 156.419 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.532334 s : Grid Default Decomposition patterns
Grid : Message : 0.532338 s : 	OpenMP threads : 160
Grid : Message : 0.532349 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.532363 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.532369 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.532374 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.532379 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.538245 s : Lookup Table Benchmark with
Grid : Message : 0.538251 s :     fine fdimensions    : [12 12 12 12 ]
Grid : Message : 0.538255 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 0.538259 s :     precision           : single
Grid : Message : 0.538262 s :     nbasis              : 40

Grid : Message : 0.644155 s : Recalculation of coarsening lookup table finished
Grid : Message : 13.405136 s : 1000 applications of vectorizableBlockProject
Grid : Message : 13.405147 s :     Time to complete            : 11.622 s
Grid : Message : 13.405197 s :     Total performance           : 6.70858 GFlops/s
Grid : Message : 13.405202 s :     Effective memory bandwidth  : 7.02965 GB/s

Grid : Message : 13.405211 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 13.405214 s :     Time to complete            : 0.429711 s
Grid : Message : 13.405218 s :     Total performance           : 181.441 GFlops/s
Grid : Message : 13.405222 s :     Effective memory bandwidth  : 190.125 GB/s

Grid : Message : 13.405232 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 13.405235 s :     Time to complete            : 0.638133 s
Grid : Message : 13.405239 s :     Total performance           : 122.18 GFlops/s
Grid : Message : 13.405243 s :     Effective memory bandwidth  : 128.028 GB/s

Grid : Message : 13.410621 s : Lookup Table Benchmark with
Grid : Message : 13.410626 s :     fine fdimensions    : [12 12 12 12 ]
Grid : Message : 13.410630 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 13.410634 s :     precision           : double
Grid : Message : 13.410637 s :     nbasis              : 40

Grid : Message : 13.521578 s : Recalculation of coarsening lookup table finished
Grid : Message : 26.763960 s : 1000 applications of vectorizableBlockProject
Grid : Message : 26.764160 s :     Time to complete            : 10.9709 s
Grid : Message : 26.764300 s :     Total performance           : 7.10672 GFlops/s
Grid : Message : 26.764340 s :     Effective memory bandwidth  : 14.8937 GB/s

Grid : Message : 26.764430 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 26.764460 s :     Time to complete            : 0.551347 s
Grid : Message : 26.764500 s :     Total performance           : 141.413 GFlops/s
Grid : Message : 26.764540 s :     Effective memory bandwidth  : 296.361 GB/s

Grid : Message : 26.764630 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 26.764660 s :     Time to complete            : 0.953862 s
Grid : Message : 26.764700 s :     Total performance           : 81.7386 GFlops/s
Grid : Message : 26.764740 s :     Effective memory bandwidth  : 171.301 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.550751 s : Grid Default Decomposition patterns
Grid : Message : 0.550755 s : 	OpenMP threads : 160
Grid : Message : 0.550766 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.550779 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.550784 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.550789 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.550794 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.556660 s : Lookup Table Benchmark with
Grid : Message : 0.556666 s :     fine fdimensions    : [16 16 16 16 ]
Grid : Message : 0.556670 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 0.556674 s :     precision           : single
Grid : Message : 0.556677 s :     nbasis              : 10

Grid : Message : 0.621101 s : Recalculation of coarsening lookup table finished
Grid : Message : 10.321362 s : 1000 applications of vectorizableBlockProject
Grid : Message : 10.321372 s :     Time to complete            : 7.34201 s
Grid : Message : 10.321419 s :     Total performance           : 8.3906 GFlops/s
Grid : Message : 10.321426 s :     Effective memory bandwidth  : 9.42883 GB/s

Grid : Message : 10.321435 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 10.321438 s :     Time to complete            : 0.860354 s
Grid : Message : 10.321442 s :     Total performance           : 71.6029 GFlops/s
Grid : Message : 10.321446 s :     Effective memory bandwidth  : 80.4628 GB/s

Grid : Message : 10.321455 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 10.321458 s :     Time to complete            : 1.44298 s
Grid : Message : 10.321462 s :     Total performance           : 42.6922 GFlops/s
Grid : Message : 10.321466 s :     Effective memory bandwidth  : 47.9747 GB/s

Grid : Message : 10.331815 s : Lookup Table Benchmark with
Grid : Message : 10.331822 s :     fine fdimensions    : [16 16 16 16 ]
Grid : Message : 10.331826 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 10.331830 s :     precision           : double
Grid : Message : 10.331833 s :     nbasis              : 10

Grid : Message : 10.402176 s : Recalculation of coarsening lookup table finished
Grid : Message : 19.914505 s : 1000 applications of vectorizableBlockProject
Grid : Message : 19.914515 s :     Time to complete            : 6.82731 s
Grid : Message : 19.914527 s :     Total performance           : 9.02315 GFlops/s
Grid : Message : 19.914532 s :     Effective memory bandwidth  : 20.2793 GB/s

Grid : Message : 19.914542 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 19.914545 s :     Time to complete            : 1.01398 s
Grid : Message : 19.914549 s :     Total performance           : 60.7546 GFlops/s
Grid : Message : 19.914553 s :     Effective memory bandwidth  : 136.544 GB/s

Grid : Message : 19.914562 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 19.914565 s :     Time to complete            : 1.61102 s
Grid : Message : 19.914569 s :     Total performance           : 38.2391 GFlops/s
Grid : Message : 19.914573 s :     Effective memory bandwidth  : 85.9415 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.528891 s : Grid Default Decomposition patterns
Grid : Message : 0.528895 s : 	OpenMP threads : 160
Grid : Message : 0.528906 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.528920 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.528925 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.528930 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.528935 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.534717 s : Lookup Table Benchmark with
Grid : Message : 0.534723 s :     fine fdimensions    : [16 16 16 16 ]
Grid : Message : 0.534727 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 0.534731 s :     precision           : single
Grid : Message : 0.534734 s :     nbasis              : 20

Grid : Message : 0.641652 s : Recalculation of coarsening lookup table finished
Grid : Message : 17.439411 s : 1000 applications of vectorizableBlockProject
Grid : Message : 17.439421 s :     Time to complete            : 14.2399 s
Grid : Message : 17.439470 s :     Total performance           : 8.65226 GFlops/s
Grid : Message : 17.439474 s :     Effective memory bandwidth  : 9.28105 GB/s

Grid : Message : 17.439483 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 17.439486 s :     Time to complete            : 0.936529 s
Grid : Message : 17.439490 s :     Total performance           : 131.558 GFlops/s
Grid : Message : 17.439494 s :     Effective memory bandwidth  : 141.118 GB/s

Grid : Message : 17.439503 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 17.439506 s :     Time to complete            : 1.52502 s
Grid : Message : 17.439509 s :     Total performance           : 80.7907 GFlops/s
Grid : Message : 17.439513 s :     Effective memory bandwidth  : 86.662 GB/s

Grid : Message : 17.451099 s : Lookup Table Benchmark with
Grid : Message : 17.451107 s :     fine fdimensions    : [16 16 16 16 ]
Grid : Message : 17.451112 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 17.451116 s :     precision           : double
Grid : Message : 17.451119 s :     nbasis              : 20

Grid : Message : 17.567519 s : Recalculation of coarsening lookup table finished
Grid : Message : 33.841252 s : 1000 applications of vectorizableBlockProject
Grid : Message : 33.841262 s :     Time to complete            : 13.2175 s
Grid : Message : 33.841277 s :     Total performance           : 9.32155 GFlops/s
Grid : Message : 33.841281 s :     Effective memory bandwidth  : 19.9979 GB/s

Grid : Message : 33.841291 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 33.841294 s :     Time to complete            : 1.10978 s
Grid : Message : 33.841298 s :     Total performance           : 111.02 GFlops/s
Grid : Message : 33.841302 s :     Effective memory bandwidth  : 238.176 GB/s

Grid : Message : 33.841311 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 33.841314 s :     Time to complete            : 1.84249 s
Grid : Message : 33.841317 s :     Total performance           : 66.8701 GFlops/s
Grid : Message : 33.841321 s :     Effective memory bandwidth  : 143.459 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.530969 s : Grid Default Decomposition patterns
Grid : Message : 0.530973 s : 	OpenMP threads : 160
Grid : Message : 0.530984 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.530998 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.531003 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.531008 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.531013 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.536822 s : Lookup Table Benchmark with
Grid : Message : 0.536828 s :     fine fdimensions    : [16 16 16 16 ]
Grid : Message : 0.536832 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 0.536836 s :     precision           : single
Grid : Message : 0.536839 s :     nbasis              : 30

Grid : Message : 0.683678 s : Recalculation of coarsening lookup table finished
Grid : Message : 24.622057 s : 1000 applications of vectorizableBlockProject
Grid : Message : 24.622067 s :     Time to complete            : 21.1702 s
Grid : Message : 24.622119 s :     Total performance           : 8.72979 GFlops/s
Grid : Message : 24.622124 s :     Effective memory bandwidth  : 9.21561 GB/s

Grid : Message : 24.622133 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 24.622136 s :     Time to complete            : 1.00913 s
Grid : Message : 24.622140 s :     Total performance           : 183.139 GFlops/s
Grid : Message : 24.622144 s :     Effective memory bandwidth  : 193.331 GB/s

Grid : Message : 24.622153 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 24.622156 s :     Time to complete            : 1.62104 s
Grid : Message : 24.622160 s :     Total performance           : 114.008 GFlops/s
Grid : Message : 24.622164 s :     Effective memory bandwidth  : 120.353 GB/s

Grid : Message : 24.635431 s : Lookup Table Benchmark with
Grid : Message : 24.635439 s :     fine fdimensions    : [16 16 16 16 ]
Grid : Message : 24.635443 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 24.635447 s :     precision           : double
Grid : Message : 24.635450 s :     nbasis              : 30

Grid : Message : 24.796667 s : Recalculation of coarsening lookup table finished
Grid : Message : 48.341591 s : 1000 applications of vectorizableBlockProject
Grid : Message : 48.341601 s :     Time to complete            : 19.7944 s
Grid : Message : 48.341615 s :     Total performance           : 9.33657 GFlops/s
Grid : Message : 48.341619 s :     Effective memory bandwidth  : 19.7123 GB/s

Grid : Message : 48.341628 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 48.341631 s :     Time to complete            : 1.29304 s
Grid : Message : 48.341635 s :     Total performance           : 142.928 GFlops/s
Grid : Message : 48.341639 s :     Effective memory bandwidth  : 301.764 GB/s

Grid : Message : 48.341648 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 48.341651 s :     Time to complete            : 2.30659 s
Grid : Message : 48.341655 s :     Total performance           : 80.1234 GFlops/s
Grid : Message : 48.341659 s :     Effective memory bandwidth  : 169.165 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.530638 s : Grid Default Decomposition patterns
Grid : Message : 0.530642 s : 	OpenMP threads : 160
Grid : Message : 0.530653 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.530668 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.530673 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.530678 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.530683 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.536563 s : Lookup Table Benchmark with
Grid : Message : 0.536570 s :     fine fdimensions    : [16 16 16 16 ]
Grid : Message : 0.536574 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 0.536578 s :     precision           : single
Grid : Message : 0.536581 s :     nbasis              : 40

Grid : Message : 0.720998 s : Recalculation of coarsening lookup table finished
Grid : Message : 32.380810 s : 1000 applications of vectorizableBlockProject
Grid : Message : 32.381020 s :     Time to complete            : 28.3322 s
Grid : Message : 32.381460 s :     Total performance           : 8.69736 GFlops/s
Grid : Message : 32.381510 s :     Effective memory bandwidth  : 9.10737 GB/s

Grid : Message : 32.381600 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 32.381630 s :     Time to complete            : 1.08206 s
Grid : Message : 32.381670 s :     Total performance           : 227.728 GFlops/s
Grid : Message : 32.381710 s :     Effective memory bandwidth  : 238.464 GB/s

Grid : Message : 32.381800 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 32.381830 s :     Time to complete            : 1.72274 s
Grid : Message : 32.381870 s :     Total performance           : 143.037 GFlops/s
Grid : Message : 32.381910 s :     Effective memory bandwidth  : 149.779 GB/s

Grid : Message : 32.525900 s : Lookup Table Benchmark with
Grid : Message : 32.525990 s :     fine fdimensions    : [16 16 16 16 ]
Grid : Message : 32.526040 s :     coarse fdimensions  : [4 4 4 4 ]
Grid : Message : 32.526080 s :     precision           : double
Grid : Message : 32.526110 s :     nbasis              : 40

Grid : Message : 32.261754 s : Recalculation of coarsening lookup table finished
Grid : Message : 63.289160 s : 1000 applications of vectorizableBlockProject
Grid : Message : 63.289280 s :     Time to complete            : 26.3558 s
Grid : Message : 63.289410 s :     Total performance           : 9.34958 GFlops/s
Grid : Message : 63.289460 s :     Effective memory bandwidth  : 19.5807 GB/s

Grid : Message : 63.289550 s : 1000 applications of vectorizableBlockProjectUsingLut
Grid : Message : 63.289580 s :     Time to complete            : 1.47054 s
Grid : Message : 63.289620 s :     Total performance           : 167.567 GFlops/s
Grid : Message : 63.289660 s :     Effective memory bandwidth  : 350.933 GB/s

Grid : Message : 63.289750 s : 1000 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 63.289780 s :     Time to complete            : 2.74166 s
Grid : Message : 63.289810 s :     Total performance           : 89.8782 GFlops/s
Grid : Message : 63.289850 s :     Effective memory bandwidth  : 188.23 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.529055 s : Grid Default Decomposition patterns
Grid : Message : 0.529059 s : 	OpenMP threads : 160
Grid : Message : 0.529071 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.529085 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.529090 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.529095 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.529099 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.534923 s : Lookup Table Benchmark with
Grid : Message : 0.534929 s :     fine fdimensions    : [24 24 24 24 ]
Grid : Message : 0.534933 s :     coarse fdimensions  : [6 6 6 6 ]
Grid : Message : 0.534937 s :     precision           : single
Grid : Message : 0.534940 s :     nbasis              : 10

Grid : Message : 0.772430 s : Recalculation of coarsening lookup table finished
Grid : Message : 6.665208 s : 500 applications of vectorizableBlockProject
Grid : Message : 6.665219 s :     Time to complete            : 4.18064 s
Grid : Message : 6.665273 s :     Total performance           : 37.2993 GFlops/s
Grid : Message : 6.665280 s :     Effective memory bandwidth  : 41.9146 GB/s

Grid : Message : 6.665290 s : 500 applications of vectorizableBlockProjectUsingLut
Grid : Message : 6.665293 s :     Time to complete            : 0.603953 s
Grid : Message : 6.665297 s :     Total performance           : 258.19 GFlops/s
Grid : Message : 6.665301 s :     Effective memory bandwidth  : 290.138 GB/s

Grid : Message : 6.665310 s : 500 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 6.665313 s :     Time to complete            : 1.01728 s
Grid : Message : 6.665317 s :     Total performance           : 153.285 GFlops/s
Grid : Message : 6.665321 s :     Effective memory bandwidth  : 172.252 GB/s

Grid : Message : 6.717101 s : Lookup Table Benchmark with
Grid : Message : 6.717110 s :     fine fdimensions    : [24 24 24 24 ]
Grid : Message : 6.717115 s :     coarse fdimensions  : [6 6 6 6 ]
Grid : Message : 6.717119 s :     precision           : double
Grid : Message : 6.717122 s :     nbasis              : 10

Grid : Message : 6.990692 s : Recalculation of coarsening lookup table finished
Grid : Message : 14.248431 s : 500 applications of vectorizableBlockProject
Grid : Message : 14.248442 s :     Time to complete            : 4.408 s
Grid : Message : 14.248456 s :     Total performance           : 35.3754 GFlops/s
Grid : Message : 14.248462 s :     Effective memory bandwidth  : 79.5052 GB/s

Grid : Message : 14.248471 s : 500 applications of vectorizableBlockProjectUsingLut
Grid : Message : 14.248474 s :     Time to complete            : 0.944183 s
Grid : Message : 14.248478 s :     Total performance           : 165.153 GFlops/s
Grid : Message : 14.248482 s :     Effective memory bandwidth  : 371.177 GB/s

Grid : Message : 14.248491 s : 500 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 14.248494 s :     Time to complete            : 1.76746 s
Grid : Message : 14.248498 s :     Total performance           : 88.2253 GFlops/s
Grid : Message : 14.248502 s :     Effective memory bandwidth  : 198.284 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.529207 s : Grid Default Decomposition patterns
Grid : Message : 0.529211 s : 	OpenMP threads : 160
Grid : Message : 0.529222 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.529237 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.529242 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.529247 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.529252 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.535105 s : Lookup Table Benchmark with
Grid : Message : 0.535111 s :     fine fdimensions    : [24 24 24 24 ]
Grid : Message : 0.535116 s :     coarse fdimensions  : [6 6 6 6 ]
Grid : Message : 0.535120 s :     precision           : single
Grid : Message : 0.535123 s :     nbasis              : 20

Grid : Message : 0.924128 s : Recalculation of coarsening lookup table finished
Grid : Message : 12.809460 s : 500 applications of vectorizableBlockProject
Grid : Message : 12.809670 s :     Time to complete            : 8.28297 s
Grid : Message : 12.810110 s :     Total performance           : 37.6519 GFlops/s
Grid : Message : 12.810160 s :     Effective memory bandwidth  : 40.3882 GB/s

Grid : Message : 12.810250 s : 500 applications of vectorizableBlockProjectUsingLut
Grid : Message : 12.810280 s :     Time to complete            : 0.976487 s
Grid : Message : 12.810320 s :     Total performance           : 319.379 GFlops/s
Grid : Message : 12.810360 s :     Effective memory bandwidth  : 342.589 GB/s

Grid : Message : 12.810450 s : 500 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 12.810480 s :     Time to complete            : 1.72642 s
Grid : Message : 12.810520 s :     Total performance           : 180.645 GFlops/s
Grid : Message : 12.810560 s :     Effective memory bandwidth  : 193.773 GB/s

Grid : Message : 12.142996 s : Lookup Table Benchmark with
Grid : Message : 12.143007 s :     fine fdimensions    : [24 24 24 24 ]
Grid : Message : 12.143012 s :     coarse fdimensions  : [6 6 6 6 ]
Grid : Message : 12.143016 s :     precision           : double
Grid : Message : 12.143019 s :     nbasis              : 20

Grid : Message : 12.621330 s : Recalculation of coarsening lookup table finished
Grid : Message : 26.677100 s : 500 applications of vectorizableBlockProject
Grid : Message : 26.677110 s :     Time to complete            : 8.66789 s
Grid : Message : 26.677124 s :     Total performance           : 35.9799 GFlops/s
Grid : Message : 26.677129 s :     Effective memory bandwidth  : 77.1892 GB/s

Grid : Message : 26.677139 s : 500 applications of vectorizableBlockProjectUsingLut
Grid : Message : 26.677142 s :     Time to complete            : 1.78944 s
Grid : Message : 26.677146 s :     Total performance           : 174.283 GFlops/s
Grid : Message : 26.677150 s :     Effective memory bandwidth  : 373.897 GB/s

Grid : Message : 26.677159 s : 500 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 26.677162 s :     Time to complete            : 3.33704 s
Grid : Message : 26.677166 s :     Total performance           : 93.457 GFlops/s
Grid : Message : 26.677170 s :     Effective memory bandwidth  : 200.497 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.528135 s : Grid Default Decomposition patterns
Grid : Message : 0.528139 s : 	OpenMP threads : 160
Grid : Message : 0.528151 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.528165 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.528170 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.528175 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.528180 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.534012 s : Lookup Table Benchmark with
Grid : Message : 0.534019 s :     fine fdimensions    : [24 24 24 24 ]
Grid : Message : 0.534023 s :     coarse fdimensions  : [6 6 6 6 ]
Grid : Message : 0.534027 s :     precision           : single
Grid : Message : 0.534030 s :     nbasis              : 30

Grid : Message : 1.813340 s : Recalculation of coarsening lookup table finished
Grid : Message : 18.570490 s : 500 applications of vectorizableBlockProject
Grid : Message : 18.570600 s :     Time to complete            : 12.3767 s
Grid : Message : 18.571120 s :     Total performance           : 37.7973 GFlops/s
Grid : Message : 18.571170 s :     Effective memory bandwidth  : 39.9008 GB/s

Grid : Message : 18.571260 s : 500 applications of vectorizableBlockProjectUsingLut
Grid : Message : 18.571290 s :     Time to complete            : 1.56822 s
Grid : Message : 18.571330 s :     Total performance           : 298.302 GFlops/s
Grid : Message : 18.571370 s :     Effective memory bandwidth  : 314.903 GB/s

Grid : Message : 18.571460 s : 500 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 18.571490 s :     Time to complete            : 2.77427 s
Grid : Message : 18.571530 s :     Total performance           : 168.622 GFlops/s
Grid : Message : 18.571570 s :     Effective memory bandwidth  : 178.006 GB/s

Grid : Message : 18.121822 s : Lookup Table Benchmark with
Grid : Message : 18.121831 s :     fine fdimensions    : [24 24 24 24 ]
Grid : Message : 18.121836 s :     coarse fdimensions  : [6 6 6 6 ]
Grid : Message : 18.121840 s :     precision           : double
Grid : Message : 18.121843 s :     nbasis              : 30

Grid : Message : 18.789223 s : Recalculation of coarsening lookup table finished
Grid : Message : 39.569991 s : 500 applications of vectorizableBlockProject
Grid : Message : 39.570001 s :     Time to complete            : 12.9197 s
Grid : Message : 39.570014 s :     Total performance           : 36.2085 GFlops/s
Grid : Message : 39.570019 s :     Effective memory bandwidth  : 76.447 GB/s

Grid : Message : 39.570028 s : 500 applications of vectorizableBlockProjectUsingLut
Grid : Message : 39.570031 s :     Time to complete            : 2.61156 s
Grid : Message : 39.570035 s :     Total performance           : 179.128 GFlops/s
Grid : Message : 39.570039 s :     Effective memory bandwidth  : 378.193 GB/s

Grid : Message : 39.570048 s : 500 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 39.570051 s :     Time to complete            : 4.88104 s
Grid : Message : 39.570055 s :     Total performance           : 95.8411 GFlops/s
Grid : Message : 39.570059 s :     Effective memory bandwidth  : 202.35 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.647942 s : Grid Default Decomposition patterns
Grid : Message : 0.647946 s : 	OpenMP threads : 160
Grid : Message : 0.647957 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.647971 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.647976 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.647981 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.647986 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.653825 s : Lookup Table Benchmark with
Grid : Message : 0.653831 s :     fine fdimensions    : [24 24 24 24 ]
Grid : Message : 0.653836 s :     coarse fdimensions  : [6 6 6 6 ]
Grid : Message : 0.653840 s :     precision           : single
Grid : Message : 0.653843 s :     nbasis              : 40

Grid : Message : 1.354759 s : Recalculation of coarsening lookup table finished
Grid : Message : 23.344689 s : 500 applications of vectorizableBlockProject
Grid : Message : 23.344702 s :     Time to complete            : 16.4677 s
Grid : Message : 23.344755 s :     Total performance           : 37.8765 GFlops/s
Grid : Message : 23.344760 s :     Effective memory bandwidth  : 39.662 GB/s

Grid : Message : 23.344769 s : 500 applications of vectorizableBlockProjectUsingLut
Grid : Message : 23.344772 s :     Time to complete            : 1.83499 s
Grid : Message : 23.344776 s :     Total performance           : 339.913 GFlops/s
Grid : Message : 23.344780 s :     Effective memory bandwidth  : 355.937 GB/s

Grid : Message : 23.344789 s : 500 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 23.344792 s :     Time to complete            : 3.35439 s
Grid : Message : 23.344796 s :     Total performance           : 185.947 GFlops/s
Grid : Message : 23.344800 s :     Effective memory bandwidth  : 194.713 GB/s

Grid : Message : 23.416994 s : Lookup Table Benchmark with
Grid : Message : 23.417003 s :     fine fdimensions    : [24 24 24 24 ]
Grid : Message : 23.417008 s :     coarse fdimensions  : [6 6 6 6 ]
Grid : Message : 23.417012 s :     precision           : double
Grid : Message : 23.417015 s :     nbasis              : 40

Grid : Message : 24.279380 s : Recalculation of coarsening lookup table finished
Grid : Message : 51.837660 s : 500 applications of vectorizableBlockProject
Grid : Message : 51.837670 s :     Time to complete            : 17.1938 s
Grid : Message : 51.837683 s :     Total performance           : 36.277 GFlops/s
Grid : Message : 51.837688 s :     Effective memory bandwidth  : 75.9743 GB/s

Grid : Message : 51.837697 s : 500 applications of vectorizableBlockProjectUsingLut
Grid : Message : 51.837700 s :     Time to complete            : 3.46308 s
Grid : Message : 51.837704 s :     Total performance           : 180.111 GFlops/s
Grid : Message : 51.837708 s :     Effective memory bandwidth  : 377.204 GB/s

Grid : Message : 51.837717 s : 500 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 51.837720 s :     Time to complete            : 6.39736 s
Grid : Message : 51.837724 s :     Total performance           : 97.4994 GFlops/s
Grid : Message : 51.837728 s :     Effective memory bandwidth  : 204.191 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.529453 s : Grid Default Decomposition patterns
Grid : Message : 0.529457 s : 	OpenMP threads : 160
Grid : Message : 0.529468 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.529483 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.529488 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.529493 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.529498 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.535391 s : Lookup Table Benchmark with
Grid : Message : 0.535397 s :     fine fdimensions    : [32 32 32 32 ]
Grid : Message : 0.535401 s :     coarse fdimensions  : [8 8 8 8 ]
Grid : Message : 0.535405 s :     precision           : single
Grid : Message : 0.535408 s :     nbasis              : 10

Grid : Message : 1.197067 s : Recalculation of coarsening lookup table finished
Grid : Message : 6.789372 s : 250 applications of vectorizableBlockProject
Grid : Message : 6.789382 s :     Time to complete            : 3.00698 s
Grid : Message : 6.789436 s :     Total performance           : 81.9477 GFlops/s
Grid : Message : 6.789443 s :     Effective memory bandwidth  : 92.0876 GB/s

Grid : Message : 6.789453 s : 250 applications of vectorizableBlockProjectUsingLut
Grid : Message : 6.789456 s :     Time to complete            : 0.873524 s
Grid : Message : 6.789460 s :     Total performance           : 282.093 GFlops/s
Grid : Message : 6.789465 s :     Effective memory bandwidth  : 316.999 GB/s

Grid : Message : 6.789474 s : 250 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 6.789477 s :     Time to complete            : 1.50526 s
Grid : Message : 6.789480 s :     Total performance           : 163.703 GFlops/s
Grid : Message : 6.789484 s :     Effective memory bandwidth  : 183.96 GB/s

Grid : Message : 6.951165 s : Lookup Table Benchmark with
Grid : Message : 6.951178 s :     fine fdimensions    : [32 32 32 32 ]
Grid : Message : 6.951183 s :     coarse fdimensions  : [8 8 8 8 ]
Grid : Message : 6.951187 s :     precision           : double
Grid : Message : 6.951190 s :     nbasis              : 10

Grid : Message : 7.714357 s : Recalculation of coarsening lookup table finished
Grid : Message : 15.956845 s : 250 applications of vectorizableBlockProject
Grid : Message : 15.956856 s :     Time to complete            : 3.80823 s
Grid : Message : 15.956870 s :     Total performance           : 64.706 GFlops/s
Grid : Message : 15.956875 s :     Effective memory bandwidth  : 145.425 GB/s

Grid : Message : 15.956885 s : 250 applications of vectorizableBlockProjectUsingLut
Grid : Message : 15.956888 s :     Time to complete            : 1.40102 s
Grid : Message : 15.956892 s :     Total performance           : 175.883 GFlops/s
Grid : Message : 15.956896 s :     Effective memory bandwidth  : 395.292 GB/s

Grid : Message : 15.956905 s : 250 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 15.956908 s :     Time to complete            : 2.65974 s
Grid : Message : 15.956911 s :     Total performance           : 92.6465 GFlops/s
Grid : Message : 15.956915 s :     Effective memory bandwidth  : 208.221 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.528550 s : Grid Default Decomposition patterns
Grid : Message : 0.528554 s : 	OpenMP threads : 160
Grid : Message : 0.528565 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.528579 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.528585 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.528590 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.528595 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.534365 s : Lookup Table Benchmark with
Grid : Message : 0.534370 s :     fine fdimensions    : [32 32 32 32 ]
Grid : Message : 0.534374 s :     coarse fdimensions  : [8 8 8 8 ]
Grid : Message : 0.534378 s :     precision           : single
Grid : Message : 0.534381 s :     nbasis              : 20

Grid : Message : 1.656560 s : Recalculation of coarsening lookup table finished
Grid : Message : 12.103406 s : 250 applications of vectorizableBlockProject
Grid : Message : 12.103417 s :     Time to complete            : 5.919 s
Grid : Message : 12.103466 s :     Total performance           : 83.2625 GFlops/s
Grid : Message : 12.103472 s :     Effective memory bandwidth  : 89.3135 GB/s

Grid : Message : 12.103483 s : 250 applications of vectorizableBlockProjectUsingLut
Grid : Message : 12.103486 s :     Time to complete            : 1.46656 s
Grid : Message : 12.103490 s :     Total performance           : 336.046 GFlops/s
Grid : Message : 12.103494 s :     Effective memory bandwidth  : 360.467 GB/s

Grid : Message : 12.103503 s : 250 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 12.103506 s :     Time to complete            : 2.67417 s
Grid : Message : 12.103511 s :     Total performance           : 184.293 GFlops/s
Grid : Message : 12.103515 s :     Effective memory bandwidth  : 197.686 GB/s

Grid : Message : 12.290963 s : Lookup Table Benchmark with
Grid : Message : 12.290974 s :     fine fdimensions    : [32 32 32 32 ]
Grid : Message : 12.290980 s :     coarse fdimensions  : [8 8 8 8 ]
Grid : Message : 12.290984 s :     precision           : double
Grid : Message : 12.290987 s :     nbasis              : 20

Grid : Message : 13.610466 s : Recalculation of coarsening lookup table finished
Grid : Message : 29.433577 s : 250 applications of vectorizableBlockProject
Grid : Message : 29.433588 s :     Time to complete            : 7.42844 s
Grid : Message : 29.433601 s :     Total performance           : 66.3438 GFlops/s
Grid : Message : 29.433606 s :     Effective memory bandwidth  : 142.33 GB/s

Grid : Message : 29.433616 s : 250 applications of vectorizableBlockProjectUsingLut
Grid : Message : 29.433619 s :     Time to complete            : 2.66304 s
Grid : Message : 29.433623 s :     Total performance           : 185.063 GFlops/s
Grid : Message : 29.433627 s :     Effective memory bandwidth  : 397.024 GB/s

Grid : Message : 29.433636 s : 250 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 29.433639 s :     Time to complete            : 5.02374 s
Grid : Message : 29.433643 s :     Total performance           : 98.1003 GFlops/s
Grid : Message : 29.433647 s :     Effective memory bandwidth  : 210.459 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.527723 s : Grid Default Decomposition patterns
Grid : Message : 0.527727 s : 	OpenMP threads : 160
Grid : Message : 0.527738 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.527752 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.527757 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.527762 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.527767 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.533675 s : Lookup Table Benchmark with
Grid : Message : 0.533682 s :     fine fdimensions    : [32 32 32 32 ]
Grid : Message : 0.533686 s :     coarse fdimensions  : [8 8 8 8 ]
Grid : Message : 0.533690 s :     precision           : single
Grid : Message : 0.533693 s :     nbasis              : 30

Grid : Message : 2.115605 s : Recalculation of coarsening lookup table finished
Grid : Message : 17.764881 s : 250 applications of vectorizableBlockProject
Grid : Message : 17.764891 s :     Time to complete            : 8.82673 s
Grid : Message : 17.764943 s :     Total performance           : 83.7509 GFlops/s
Grid : Message : 17.764948 s :     Effective memory bandwidth  : 88.4117 GB/s

Grid : Message : 17.764959 s : 250 applications of vectorizableBlockProjectUsingLut
Grid : Message : 17.764962 s :     Time to complete            : 2.22032 s
Grid : Message : 17.764966 s :     Total performance           : 332.945 GFlops/s
Grid : Message : 17.764970 s :     Effective memory bandwidth  : 351.474 GB/s

Grid : Message : 17.764979 s : 250 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 17.764982 s :     Time to complete            : 4.02574 s
Grid : Message : 17.764986 s :     Total performance           : 183.63 GFlops/s
Grid : Message : 17.764990 s :     Effective memory bandwidth  : 193.849 GB/s

Grid : Message : 17.973322 s : Lookup Table Benchmark with
Grid : Message : 17.973336 s :     fine fdimensions    : [32 32 32 32 ]
Grid : Message : 17.973341 s :     coarse fdimensions  : [8 8 8 8 ]
Grid : Message : 17.973345 s :     precision           : double
Grid : Message : 17.973348 s :     nbasis              : 30

Grid : Message : 19.884276 s : Recalculation of coarsening lookup table finished
Grid : Message : 43.225205 s : 250 applications of vectorizableBlockProject
Grid : Message : 43.225216 s :     Time to complete            : 11.049 s
Grid : Message : 43.225229 s :     Total performance           : 66.9062 GFlops/s
Grid : Message : 43.225234 s :     Effective memory bandwidth  : 141.259 GB/s

Grid : Message : 43.225243 s : 250 applications of vectorizableBlockProjectUsingLut
Grid : Message : 43.225246 s :     Time to complete            : 3.90586 s
Grid : Message : 43.225250 s :     Total performance           : 189.266 GFlops/s
Grid : Message : 43.225254 s :     Effective memory bandwidth  : 399.598 GB/s

Grid : Message : 43.225263 s : 250 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 43.225266 s :     Time to complete            : 7.35057 s
Grid : Message : 43.225270 s :     Total performance           : 100.57 GFlops/s
Grid : Message : 43.225274 s :     Effective memory bandwidth  : 212.334 GB/s

AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 0
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 1
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 2
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device Number    : 3
AcceleratorCudaInit: ========================
AcceleratorCudaInit: Device identifier: Tesla P100-SXM2-16GB
AcceleratorCudaInit:   totalGlobalMem: 17071734784 
AcceleratorCudaInit:   managedMemory: 1 
AcceleratorCudaInit:   isMultiGpuBoard: 0 
AcceleratorCudaInit:   warpSize: 32 
AcceleratorCudaInit: setting device to node rank
AcceleratorCudaInit: ================================================
SharedMemoryMpi:  World communicator of size 1
SharedMemoryMpi:  Node  communicator of size 1
SharedMemoryMpi:  SharedMemoryMPI.cc cudaMalloc 1073741824bytes at 0x110020000000 for comms buffers 

__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|_ |  |  |  |  |  |  |  |  |  |  |  | _|__
__|_                                    _|__
__|_   GGGG    RRRR    III    DDDD      _|__
__|_  G        R   R    I     D   D     _|__
__|_  G        R   R    I     D    D    _|__
__|_  G  GG    RRRR     I     D    D    _|__
__|_  G   G    R  R     I     D   D     _|__
__|_   GGGG    R   R   III    DDDD      _|__
__|_                                    _|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
__|__|__|__|__|__|__|__|__|__|__|__|__|__|__
  |  |  |  |  |  |  |  |  |  |  |  |  |  |  


Copyright (C) 2015 Peter Boyle, Azusa Yamaguchi, Guido Cossu, Antonin Portelli and other authors

This program is free software; you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation; either version 2 of the License, or
(at your option) any later version.

This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
GNU General Public License for more details.
Current Grid git commit hash=63b0a19f370f643aa5b97f37bd1a18ea33a209f8: (HEAD, origin/feature/gpt, origin/HEAD, feature/gpt) clean

Grid : Message : ================================================ 
Grid : Message : MPI is initialised and logging filters activated 
Grid : Message : ================================================ 
Grid : Message : Requested 1073741824 byte stencil comms buffers 
Grid : Message : MemoryManager::Init() setting up
Grid : Message : MemoryManager::Init() cache pool for recent allocations: SMALL 32 LARGE 8
Grid : Message : MemoryManager::Init() Unified memory space
Grid : Message : MemoryManager::Init() Using cudaMallocManaged
Grid : Message : 0.527943 s : Grid Default Decomposition patterns
Grid : Message : 0.527947 s : 	OpenMP threads : 160
Grid : Message : 0.527958 s : 	MPI tasks      : 1 1 1 1 
Grid : Message : 0.527972 s : 	vRealF         : 512bits ; 2 2 2 2 
Grid : Message : 0.527977 s : 	vRealD         : 512bits ; 1 2 2 2 
Grid : Message : 0.527982 s : 	vComplexF      : 512bits ; 1 2 2 2 
Grid : Message : 0.527987 s : 	vComplexD      : 512bits ; 1 1 2 2 
Grid : Message : 0.533758 s : Lookup Table Benchmark with
Grid : Message : 0.533764 s :     fine fdimensions    : [32 32 32 32 ]
Grid : Message : 0.533768 s :     coarse fdimensions  : [8 8 8 8 ]
Grid : Message : 0.533772 s :     precision           : single
Grid : Message : 0.533775 s :     nbasis              : 40

Grid : Message : 2.568811 s : Recalculation of coarsening lookup table finished
Grid : Message : 22.966113 s : 250 applications of vectorizableBlockProject
Grid : Message : 22.966123 s :     Time to complete            : 11.7333 s
Grid : Message : 22.966176 s :     Total performance           : 84.0056 GFlops/s
Grid : Message : 22.966181 s :     Effective memory bandwidth  : 87.9657 GB/s

Grid : Message : 22.966190 s : 250 applications of vectorizableBlockProjectUsingLut
Grid : Message : 22.966193 s :     Time to complete            : 2.78534 s
Grid : Message : 22.966197 s :     Total performance           : 353.875 GFlops/s
Grid : Message : 22.966201 s :     Effective memory bandwidth  : 370.557 GB/s

Grid : Message : 22.966210 s : 250 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 22.966213 s :     Time to complete            : 5.11949 s
Grid : Message : 22.966217 s :     Total performance           : 192.531 GFlops/s
Grid : Message : 22.966221 s :     Effective memory bandwidth  : 201.607 GB/s

Grid : Message : 23.194057 s : Lookup Table Benchmark with
Grid : Message : 23.194069 s :     fine fdimensions    : [32 32 32 32 ]
Grid : Message : 23.194074 s :     coarse fdimensions  : [8 8 8 8 ]
Grid : Message : 23.194078 s :     precision           : double
Grid : Message : 23.194081 s :     nbasis              : 40

Grid : Message : 25.683086 s : Recalculation of coarsening lookup table finished
Grid : Message : 56.755173 s : 250 applications of vectorizableBlockProject
Grid : Message : 56.755184 s :     Time to complete            : 14.6865 s
Grid : Message : 56.755199 s :     Total performance           : 67.1134 GFlops/s
Grid : Message : 56.755204 s :     Effective memory bandwidth  : 140.554 GB/s

Grid : Message : 56.755213 s : 250 applications of vectorizableBlockProjectUsingLut
Grid : Message : 56.755216 s :     Time to complete            : 5.22948 s
Grid : Message : 56.755220 s :     Total performance           : 188.482 GFlops/s
Grid : Message : 56.755224 s :     Effective memory bandwidth  : 394.734 GB/s

Grid : Message : 56.755233 s : 250 applications of vectorizableBlockProjectUsingNoLut
Grid : Message : 56.755236 s :     Time to complete            : 9.77977 s
Grid : Message : 56.755240 s :     Total performance           : 100.786 GFlops/s
Grid : Message : 56.755244 s :     Effective memory bandwidth  : 211.074 GB/s