{
  "links": {
    "bibtex": "https://inspirehep.net/api/literature/1511682?format=bibtex",
    "latex-eu": "https://inspirehep.net/api/literature/1511682?format=latex-eu",
    "latex-us": "https://inspirehep.net/api/literature/1511682?format=latex-us",
    "json": "https://inspirehep.net/api/literature/1511682?format=json",
    "json-expanded": "https://inspirehep.net/api/literature/1511682?format=json-expanded",
    "cv": "https://inspirehep.net/api/literature/1511682?format=cv",
    "citations": "https://inspirehep.net/api/literature/?q=refersto%3Arecid%3A1511682"
  },
  "updated": "2025-08-04T17:24:25.323056+00:00",
  "revision_id": 108,
  "id": "1511682",
  "uuid": "3e87db88-9b61-4d88-bd60-3193346fece2",
  "metadata": {
    "citation_count_without_self_citations": 9,
    "citation_count": 10,
    "publication_info": [
      {
        "cnum": "C16-07-24",
        "year": 2017,
        "artid": "013",
        "page_start": "013",
        "journal_title": "PoS",
        "parent_record": {
          "$ref": "https://inspirehep.net/api/literature/1391579"
        },
        "journal_record": {
          "$ref": "https://inspirehep.net/api/journals/1213080"
        },
        "journal_volume": "LATTICE2016",
        "conference_record": {
          "$ref": "https://inspirehep.net/api/conferences/1391578"
        }
      }
    ],
    "core": true,
    "dois": [
      {
        "value": "10.22323/1.256.0013"
      }
    ],
    "titles": [
      {
        "title": "Machines and Algorithms",
        "source": "arXiv"
      }
    ],
    "$schema": "https://inspirehep.net/schemas/records/hep.json",
    "authors": [
      {
        "ids": [
          {
            "value": "INSPIRE-00068790",
            "schema": "INSPIRE ID"
          }
        ],
        "uuid": "bdca04f3-0917-49da-a076-c1ceba36efb1",
        "record": {
          "$ref": "https://inspirehep.net/api/authors/1015525"
        },
        "full_name": "Boyle, Peter A",
        "affiliations": [
          {
            "value": "Edinburgh U.",
            "record": {
              "$ref": "https://inspirehep.net/api/institutions/902787"
            }
          }
        ],
        "signature_block": "BYLp",
        "curated_relation": true
      }
    ],
    "curated": true,
    "figures": [
      {
        "key": "d7c343c739af0028e7b44caebcdaa2f8",
        "url": "https://inspirehep.net/files/d7c343c739af0028e7b44caebcdaa2f8",
        "source": "arxiv",
        "caption": "The continued Moore's law increase in transistor density has ceased to be accompanied by growth in cpu frequency\\cite{KNLbook,borkar}: the need to avoid transistor leakage has limited voltage reductions and gate delay has stalled; in addition a wire delay floor has been exposed which affects the degree to which a chip can be internally connected. As a consequence the transister budgets have increasingly been spent growing on chip parallelism (cores, threads and vectors).",
        "filename": "Fig01-04.png"
      },
      {
        "key": "b7d52f2ee9622b918ba3389616949242",
        "url": "https://inspirehep.net/files/b7d52f2ee9622b918ba3389616949242",
        "source": "arxiv",
        "caption": "The continued Moore's law increase in transistor density has ceased to be accompanied by growth in cpu frequency\\cite{KNLbook,borkar}: the need to avoid transistor leakage has limited voltage reductions and gate delay has stalled; in addition a wire delay floor has been exposed which affects the degree to which a chip can be internally connected. As a consequence the transister budgets have increasingly been spent growing on chip parallelism (cores, threads and vectors).",
        "filename": "Fig01-01.png"
      },
      {
        "key": "eb8f2112ab7823b63a4fa7be80966009",
        "url": "https://inspirehep.net/files/eb8f2112ab7823b63a4fa7be80966009",
        "source": "arxiv",
        "caption": "The continued Moore's law increase in transistor density has ceased to be accompanied by growth in cpu frequency\\cite{KNLbook,borkar}: the need to avoid transistor leakage has limited voltage reductions and gate delay has stalled; in addition a wire delay floor has been exposed which affects the degree to which a chip can be internally connected. As a consequence the transister budgets have increasingly been spent growing on chip parallelism (cores, threads and vectors).",
        "filename": "Fig01-02.png"
      },
      {
        "key": "2e90d3ed0dc9b062c67103e89aab0908",
        "url": "https://inspirehep.net/files/2e90d3ed0dc9b062c67103e89aab0908",
        "source": "arxiv",
        "caption": "Roadmap documented floating point peak per system (left), and per node (middle) versus planned year of introduction. Increased computer performance largely comes from increased parallelism. \\label{fig:netbw} (right) Network bandwidth per node verus planned year of introduction. Compared the BlueGene/Q system, which was highly scalable for QCD simulations a factor of 400 increase in per node single precision performance anticipated, however this is accompanied by very little improvement in interconnect bandwidth. It is clear that algorithmic changes will be required to make best use of such computers.",
        "filename": "flopssys.png"
      },
      {
        "key": "df63c9eb6b4b3abd94feb4442e605789",
        "url": "https://inspirehep.net/files/df63c9eb6b4b3abd94feb4442e605789",
        "source": "arxiv",
        "caption": "Roadmap documented floating point peak per system (left), and per node (middle) versus planned year of introduction. Increased computer performance largely comes from increased parallelism. \\label{fig:netbw} (right) Network bandwidth per node verus planned year of introduction. Compared the BlueGene/Q system, which was highly scalable for QCD simulations a factor of 400 increase in per node single precision performance anticipated, however this is accompanied by very little improvement in interconnect bandwidth. It is clear that algorithmic changes will be required to make best use of such computers.",
        "filename": "flopspn.png"
      },
      {
        "key": "af95140a01f6e00bc345ee41fcf2b534",
        "url": "https://inspirehep.net/files/af95140a01f6e00bc345ee41fcf2b534",
        "source": "arxiv",
        "caption": "Roadmap documented floating point peak per system (left), and per node (middle) versus planned year of introduction. Increased computer performance largely comes from increased parallelism. \\label{fig:netbw} (right) Network bandwidth per node verus planned year of introduction. Compared the BlueGene/Q system, which was highly scalable for QCD simulations a factor of 400 increase in per node single precision performance anticipated, however this is accompanied by very little improvement in interconnect bandwidth. It is clear that algorithmic changes will be required to make best use of such computers.",
        "filename": "netbw.png"
      },
      {
        "key": "94e9663c65fba342a51bc0411224657a",
        "url": "https://inspirehep.net/files/94e9663c65fba342a51bc0411224657a",
        "source": "arxiv",
        "caption": "(Left) We display a micrograph of a through silicon via bus structure giving thousands of bit lanes connecting memory chips with using a stacked configuration to give wiring geometries favourable from an energy and delay perspective. Examples are displayed in use in recent 2.5D computational devices from Nvidia Pascal GP100 (HBM with silicon interposer)\\cite{gp100}, and Intel Knight's Landing (using Intel's own memory approach) \\cite{KNLbook}.",
        "filename": "1-TSV-in-memory-x-800.png"
      },
      {
        "key": "3c2eb5850be35f9688357fd516c0e9b2",
        "url": "https://inspirehep.net/files/3c2eb5850be35f9688357fd516c0e9b2",
        "source": "arxiv",
        "caption": "(Left) We display a micrograph of a through silicon via bus structure giving thousands of bit lanes connecting memory chips with using a stacked configuration to give wiring geometries favourable from an energy and delay perspective. Examples are displayed in use in recent 2.5D computational devices from Nvidia Pascal GP100 (HBM with silicon interposer)\\cite{gp100}, and Intel Knight's Landing (using Intel's own memory approach) \\cite{KNLbook}.",
        "filename": "PASCAL.png"
      },
      {
        "key": "634504df26bcce9a19f89b2b0a2f2974",
        "url": "https://inspirehep.net/files/634504df26bcce9a19f89b2b0a2f2974",
        "source": "arxiv",
        "caption": "(Left) We display a micrograph of a through silicon via bus structure giving thousands of bit lanes connecting memory chips with using a stacked configuration to give wiring geometries favourable from an energy and delay perspective. Examples are displayed in use in recent 2.5D computational devices from Nvidia Pascal GP100 (HBM with silicon interposer)\\cite{gp100}, and Intel Knight's Landing (using Intel's own memory approach) \\cite{KNLbook}.",
        "filename": "HMC.png"
      },
      {
        "key": "fbea9573d04bce302267a00fd41abfc1",
        "url": "https://inspirehep.net/files/fbea9573d04bce302267a00fd41abfc1",
        "source": "arxiv",
        "caption": "Left: an active optical cable carrying just four bit lanes in each direction and costing around \\$1000 USD. The cost is dominated by the transceiver electronics on either end. Right: a passive optical cable carrying 64 bit lanes. There is little prospect for an further improvement in density; the pitch is set by the need for a single grain of dust to not block the light path of any bit. Indeed, optical lenses broaden the beam inside these connectors to enhance the blockage tolerance.",
        "filename": "FCBN425QB1C_SPL.png"
      },
      {
        "key": "535f15b20ebaea41d74bce4cc072aeae",
        "url": "https://inspirehep.net/files/535f15b20ebaea41d74bce4cc072aeae",
        "source": "arxiv",
        "caption": "Left: an active optical cable carrying just four bit lanes in each direction and costing around \\$1000 USD. The cost is dominated by the transceiver electronics on either end. Right: a passive optical cable carrying 64 bit lanes. There is little prospect for an further improvement in density; the pitch is set by the need for a single grain of dust to not block the light path of any bit. Indeed, optical lenses broaden the beam inside these connectors to enhance the blockage tolerance.",
        "filename": "intel-corning-fiber-640x424.png"
      },
      {
        "key": "0874dfc05d5f48cf12968972804bba80",
        "url": "https://inspirehep.net/files/0874dfc05d5f48cf12968972804bba80",
        "source": "arxiv",
        "caption": "The Intel Knights Landing processor die and a schematic diagram of the on-chip tiled arrangment and memory system mesh. High bandwidth memory on package augments the 6 channel DDR4 interface giving good memory bandwidth. 36 lanes of PCIe Gen 3.0 interface is included, in principle giving a good I/O bandwidth (over 64GB/s bidirectional) to whatever interconnect one can afford to combine this node with. The Tile consists of a pair of processor cores and 1MB local L2 cache.",
        "filename": "intel-knl-dieshot-rs.png"
      },
      {
        "key": "47bceeb8264abbbccdac903131c303e4",
        "url": "https://inspirehep.net/files/47bceeb8264abbbccdac903131c303e4",
        "source": "arxiv",
        "caption": "The Intel Knights Landing processor die and a schematic diagram of the on-chip tiled arrangment and memory system mesh. High bandwidth memory on package augments the 6 channel DDR4 interface giving good memory bandwidth. 36 lanes of PCIe Gen 3.0 interface is included, in principle giving a good I/O bandwidth (over 64GB/s bidirectional) to whatever interconnect one can afford to combine this node with. The Tile consists of a pair of processor cores and 1MB local L2 cache.",
        "filename": "KNLmesh.png"
      },
      {
        "key": "2e598a1b5b0a6c99bc7dd4edda3f94d1",
        "url": "https://inspirehep.net/files/2e598a1b5b0a6c99bc7dd4edda3f94d1",
        "source": "arxiv",
        "caption": "The Intel Knights Landing processor die and a schematic diagram of the on-chip tiled arrangment and memory system mesh. High bandwidth memory on package augments the 6 channel DDR4 interface giving good memory bandwidth. 36 lanes of PCIe Gen 3.0 interface is included, in principle giving a good I/O bandwidth (over 64GB/s bidirectional) to whatever interconnect one can afford to combine this node with. The Tile consists of a pair of processor cores and 1MB local L2 cache.",
        "filename": "KNL-F.png"
      },
      {
        "key": "7378dd8404efac6255d924c9af90c6be",
        "url": "https://inspirehep.net/files/7378dd8404efac6255d924c9af90c6be",
        "source": "arxiv",
        "caption": "Left: The Nvidia PASCAL GPU consists of a number of ``streaming multiprocessors''. Each of these has a single instruction fetch engine which can operate in a data parallel fashion on a number of SIMD lanes. These SIMD registers contain both floating point and integer ``programme state''. Nvidia refers to each lane as a CUDA core, however the CUDA cores do not fetch instructions independently. If different threads branch in a data dependent manner the shared instruction fetch must sequence both directions serially. Given the difference in the nature of the core between architectures, it is most reasonable to compare the number of floating point SIMD lanes between CPU's and GPU's as a like for like comparison. Right: The new NVLink interface demonstrating internal node connectivity for multi-GPU nodes. In some IBM Power8 products NVLink can replace PCI express for the purpose of bulk transfers to the host memory and/or interconnection network helping to address bottlenecks in scaling multi-GPU simulations on previous generations.",
        "filename": "SM.png"
      },
      {
        "key": "a7f2e96bd12c6eaf13c5e69b65ca4e17",
        "url": "https://inspirehep.net/files/a7f2e96bd12c6eaf13c5e69b65ca4e17",
        "source": "arxiv",
        "caption": "Left: The Nvidia PASCAL GPU consists of a number of ``streaming multiprocessors''. Each of these has a single instruction fetch engine which can operate in a data parallel fashion on a number of SIMD lanes. These SIMD registers contain both floating point and integer ``programme state''. Nvidia refers to each lane as a CUDA core, however the CUDA cores do not fetch instructions independently. If different threads branch in a data dependent manner the shared instruction fetch must sequence both directions serially. Given the difference in the nature of the core between architectures, it is most reasonable to compare the number of floating point SIMD lanes between CPU's and GPU's as a like for like comparison. Right: The new NVLink interface demonstrating internal node connectivity for multi-GPU nodes. In some IBM Power8 products NVLink can replace PCI express for the purpose of bulk transfers to the host memory and/or interconnection network helping to address bottlenecks in scaling multi-GPU simulations on previous generations.",
        "filename": "NVLINK.png"
      },
      {
        "key": "bea67a7cc96b030c462e3d90603cb1f0",
        "url": "https://inspirehep.net/files/bea67a7cc96b030c462e3d90603cb1f0",
        "source": "arxiv",
        "caption": "\\label{fig:qphix1} Left: Grid performance for $L_s=16$ applications of the Wilson operator on a single KNL 7250 node. The Haswell benchmarks are from Cori Phase-1 32 core dual Haswell Cray XC40 nodes. The memory system performance of the KNL at large memory footprints demonstrates the efficacy of the memory integration, and the performance difference is marked. Right: The performance of the QPhiX Wilson dslash on KNL for a variety of partial vector layout transformations, hyperthreads per core, and running from both the 6 channel DDR memory and the on package MCDRAM. Again, the benefits of high performance 3D memory integration are clear. In contrast to Grid assembly code, the best performance is obtain with more than one hyperthread per core since the latency tolerance to stack evictions is greater.",
        "filename": "GridKNL.png"
      },
      {
        "key": "8d510942784036b04e0ed82de2beadc1",
        "url": "https://inspirehep.net/files/8d510942784036b04e0ed82de2beadc1",
        "source": "arxiv",
        "caption": "\\label{fig:qphix1} Left: Grid performance for $L_s=16$ applications of the Wilson operator on a single KNL 7250 node. The Haswell benchmarks are from Cori Phase-1 32 core dual Haswell Cray XC40 nodes. The memory system performance of the KNL at large memory footprints demonstrates the efficacy of the memory integration, and the performance difference is marked. Right: The performance of the QPhiX Wilson dslash on KNL for a variety of partial vector layout transformations, hyperthreads per core, and running from both the 6 channel DDR memory and the on package MCDRAM. Again, the benefits of high performance 3D memory integration are clear. In contrast to Grid assembly code, the best performance is obtain with more than one hyperthread per core since the latency tolerance to stack evictions is greater.",
        "filename": "wilson_dslash.png"
      },
      {
        "key": "b783cca909617f45ad2074791f5f7f75",
        "url": "https://inspirehep.net/files/b783cca909617f45ad2074791f5f7f75",
        "source": "arxiv",
        "caption": "Multinode scaling of key QCD routines in KNL over a single rail Omnipath 1.0 network. The Wilson and Clover Dslash routines give around 250GF/s per node in single precision in multinode code.",
        "filename": "scaling_knl.png"
      },
      {
        "key": "a9f0792e941d64543c262c12f22bcaa4",
        "url": "https://inspirehep.net/files/a9f0792e941d64543c262c12f22bcaa4",
        "source": "arxiv",
        "caption": "Left: MILC code double precision single node multimass HISQ conjugate gradient performance in flat memory mode. The MILC+QPhiX code substantially beats the scalar MILC code emphasizing the need to use compiler intrinsics or other vectorisation schemes. The code is heavily linear algebra dominated (it is a multimass solver) and this limits performance. Around 80\\% of available memory bandwidth is consumed. Right: Multinode KNL performance of the multimass solver. Since linear algebra dominates the communiction the performance scales with node count very well, at least over the limited test range.",
        "filename": "cg_knl_flat.png"
      },
      {
        "key": "560cbbbd131b2045b6a39add7fd4f5ff",
        "url": "https://inspirehep.net/files/560cbbbd131b2045b6a39add7fd4f5ff",
        "source": "arxiv",
        "caption": "Left: MILC code double precision single node multimass HISQ conjugate gradient performance in flat memory mode. The MILC+QPhiX code substantially beats the scalar MILC code emphasizing the need to use compiler intrinsics or other vectorisation schemes. The code is heavily linear algebra dominated (it is a multimass solver) and this limits performance. Around 80\\% of available memory bandwidth is consumed. Right: Multinode KNL performance of the multimass solver. Since linear algebra dominates the communiction the performance scales with node count very well, at least over the limited test range.",
        "filename": "mpi_flops_knl.png"
      },
      {
        "key": "93b1e0d686583a545542d9ddf7e13946",
        "url": "https://inspirehep.net/files/93b1e0d686583a545542d9ddf7e13946",
        "source": "arxiv",
        "caption": "Left: MILC code double precision single node multimass HISQ conjugate gradient performance in flat memory mode. The MILC+QPhiX code substantially beats the scalar MILC code emphasizing the need to use compiler intrinsics or other vectorisation schemes. The code is heavily linear algebra dominated (it is a multimass solver) and this limits performance. Around 80\\% of available memory bandwidth is consumed. Right: Multinode KNL performance of the multimass solver. Since linear algebra dominates the communiction the performance scales with node count very well, at least over the limited test range.",
        "filename": "mpi_flops_bdwb0.png"
      },
      {
        "key": "62bec965bc3c4fcbb89760171676ebbe",
        "url": "https://inspirehep.net/files/62bec965bc3c4fcbb89760171676ebbe",
        "source": "arxiv",
        "caption": "Left: Multinode Broadwell dual node performance of the multimass solver. Since linear algebra dominates the communiction bandwidth that is common to both the Broadwell nodes and the KNL nodes is irrelevant in this benchmark and the memory bandwidth of the KNL system leaves one KNL chip substantially faster than two Broadwell chips in a dual socket node. Right: The single node HISQ Dslash developed by Patrick Steinbrecher (Bielefeld \\&amp; BNL)\\cite{steinbrecher} for multiple right hand sides saturates at around 1100 GF/s in single precision on a KNL 7250 part. When combined with the algorithmic efficiency of block solvers\\cite{Wagner}, this approach has multiplied attractions of giving greater numerical and algorithmic performance and is compelling. Right: A reminder of how far scaling can be pursued (to 1.6M cores) on BlueGene/Q where the integration of strong interconnect bandwidth in a way that scales well with the floating point performance enables a cheap route to massive scalability. In contrast, if powerful nodes are coupled to weak networks most of the available floating point would lie idle if even moderate node counts were used on on a tractable lattice volume. There is no substitute for around 1 GB/s per sustained Gflop/s.",
        "filename": "knl_peter.png"
      },
      {
        "key": "edb320d7c5d879d3733f7423fd5aa720",
        "url": "https://inspirehep.net/files/edb320d7c5d879d3733f7423fd5aa720",
        "source": "arxiv",
        "caption": "Left: Multinode Broadwell dual node performance of the multimass solver. Since linear algebra dominates the communiction bandwidth that is common to both the Broadwell nodes and the KNL nodes is irrelevant in this benchmark and the memory bandwidth of the KNL system leaves one KNL chip substantially faster than two Broadwell chips in a dual socket node. Right: The single node HISQ Dslash developed by Patrick Steinbrecher (Bielefeld \\&amp; BNL)\\cite{steinbrecher} for multiple right hand sides saturates at around 1100 GF/s in single precision on a KNL 7250 part. When combined with the algorithmic efficiency of block solvers\\cite{Wagner}, this approach has multiplied attractions of giving greater numerical and algorithmic performance and is compelling. Right: A reminder of how far scaling can be pursued (to 1.6M cores) on BlueGene/Q where the integration of strong interconnect bandwidth in a way that scales well with the floating point performance enables a cheap route to massive scalability. In contrast, if powerful nodes are coupled to weak networks most of the available floating point would lie idle if even moderate node counts were used on on a tractable lattice volume. There is no substitute for around 1 GB/s per sustained Gflop/s.",
        "filename": "Weak_Scale_update_v6_in_nodes.png"
      },
      {
        "key": "f094ef33776a8ccc55a88f11af4de1eb",
        "url": "https://inspirehep.net/files/f094ef33776a8ccc55a88f11af4de1eb",
        "source": "arxiv",
        "caption": "Left: Clover term + $D_W$ relevant to the solution of standard Clover-Wilson fermions on the Pascal GP100 GPU in single precisoin. Right: 16 RHS $D_W$ relevant to multiple Wilson RHS solvers and to various 5d Chiral fermion approaches such as DWF. Results are in single precision on a Pascal GP100.",
        "filename": "clov.png"
      },
      {
        "key": "d0d4fbb8c304ccb5cf211144c7979c21",
        "url": "https://inspirehep.net/files/d0d4fbb8c304ccb5cf211144c7979c21",
        "source": "arxiv",
        "caption": "Left: Clover term + $D_W$ relevant to the solution of standard Clover-Wilson fermions on the Pascal GP100 GPU in single precisoin. Right: 16 RHS $D_W$ relevant to multiple Wilson RHS solvers and to various 5d Chiral fermion approaches such as DWF. Results are in single precision on a Pascal GP100.",
        "filename": "dslash4.png"
      },
      {
        "key": "2c4fda3e8ad45bd6922734eeeed68d48",
        "url": "https://inspirehep.net/files/2c4fda3e8ad45bd6922734eeeed68d48",
        "source": "arxiv",
        "caption": "Left: Multi RHS Staggered matrix multiply on the GP100 part. Again roughly 2x performance per node is obtained compared to KNL in single node code. Right: Scaling across DGX-1 8 gpu system; for sufficiently large volumes (e.g. $48^4$) the scaling is linear to 8 nodes, and around 10TFlop/s.",
        "filename": "stagrhs.png"
      },
      {
        "key": "6a17ed6e64c65998a04d2317edd0ae4c",
        "url": "https://inspirehep.net/files/6a17ed6e64c65998a04d2317edd0ae4c",
        "source": "arxiv",
        "caption": "Left: Multi RHS Staggered matrix multiply on the GP100 part. Again roughly 2x performance per node is obtained compared to KNL in single node code. Right: Scaling across DGX-1 8 gpu system; for sufficiently large volumes (e.g. $48^4$) the scaling is linear to 8 nodes, and around 10TFlop/s.",
        "filename": "dgx1.png"
      },
      {
        "key": "8a6963b387801be57c35d460bfeab31b",
        "url": "https://inspirehep.net/files/8a6963b387801be57c35d460bfeab31b",
        "source": "arxiv",
        "caption": "We display results (left) from SGI having developed a perfect mapping scheme between logical and physical interconnect topologies(right). No weak scaling slow down is seen even for the largest system sizes. Results were produced on the Cheyenne system at NCAR. Compared to the default MPI Cartesian communicator a 4x speedup is seen on the largest partition sizes, demonstrating the effectiveness of topology aware job placement.",
        "filename": "NCAR.png"
      },
      {
        "key": "24ce8f455a9f095beb513f71998107f9",
        "url": "https://inspirehep.net/files/24ce8f455a9f095beb513f71998107f9",
        "source": "arxiv",
        "caption": "We display results (left) from SGI having developed a perfect mapping scheme between logical and physical interconnect topologies(right). No weak scaling slow down is seen even for the largest system sizes. Results were produced on the Cheyenne system at NCAR. Compared to the default MPI Cartesian communicator a 4x speedup is seen on the largest partition sizes, demonstrating the effectiveness of topology aware job placement.",
        "filename": "2D_wormhole.png"
      },
      {
        "key": "af7cafe4223ed3f63be0ac4a51197d4d",
        "url": "https://inspirehep.net/files/af7cafe4223ed3f63be0ac4a51197d4d",
        "source": "arxiv",
        "caption": "Left: 2D staggered multigrid elimination of critical slowing down. Right: Improvement in a disconnected correlation function from multi-level integration.",
        "filename": "fdslash_l128t128b100n4_peter.png"
      },
      {
        "key": "495c8b0a64ab375b3d551286a9b64408",
        "url": "https://inspirehep.net/files/495c8b0a64ab375b3d551286a9b64408",
        "source": "arxiv",
        "caption": "Left: 2D staggered multigrid elimination of critical slowing down. Right: Improvement in a disconnected correlation function from multi-level integration.",
        "filename": "disco.png"
      }
    ],
    "license": [
      {
        "url": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
        "imposing": "arXiv"
      },
      {
        "license": "CC-BY-NC-SA",
        "imposing": "SISSA"
      }
    ],
    "texkeys": [
      "Boyle:2017wul",
      "Boyle:2017vhi"
    ],
    "citeable": true,
    "imprints": [
      {
        "date": "2017-02-01",
        "publisher": "SISSA"
      }
    ],
    "keywords": [
      {
        "value": "quantum chromodynamics",
        "schema": "INSPIRE"
      },
      {
        "value": "numerical calculations",
        "schema": "INSPIRE"
      },
      {
        "value": "computer: performance",
        "schema": "INSPIRE"
      },
      {
        "value": "engineering",
        "schema": "INSPIRE"
      },
      {
        "value": "computer: communications",
        "schema": "INSPIRE"
      },
      {
        "value": "numerical methods: efficiency",
        "schema": "INSPIRE"
      }
    ],
    "abstracts": [
      {
        "value": "I discuss the evolution of computer architectures with a focus on QCD and with reference to the interplay between architecture, engineering, data motion and algorithms. New architectures are discussed and recent performance results are displayed. I also review recent progress in multilevel solver and integation algorithms.",
        "source": "arXiv"
      }
    ],
    "references": [
      {
        "reference": {
          "misc": [
            "\"Intel Xeon Phi(TM) Processor High Performance Programming, Knights Landing Edition, by Jim Jeffers, James Reinders, and Avinash Sodani"
          ],
          "label": "1"
        }
      },
      {
        "reference": {
          "urls": [
            {
              "value": "https://images.nvidia.com/content/pdf/tesla/whitepaper/pascal-architecture-whitepaper.pdf"
            }
          ],
          "label": "2"
        }
      },
      {
        "reference": {
          "misc": [
            "Shekhar Borkar Communications of the ACM, Vol. 54 No. 5, Pages 67-77"
          ],
          "label": "3",
          "authors": [
            {
              "full_name": "Chien, Andrew A."
            }
          ]
        }
      },
      {
        "reference": {
          "urls": [
            {
              "value": "https://en.wikipedia.org/wiki/3D_XPoint"
            }
          ],
          "label": "4"
        }
      },
      {
        "reference": {
          "misc": [
            "Wm SIGARCH Comput. Archit. News 23, 1 (March), 20-24"
          ],
          "label": "5",
          "authors": [
            {
              "full_name": "Wulf, A."
            },
            {
              "full_name": "McKee, Sally A."
            }
          ],
          "publication_info": {
            "year": 1995
          }
        }
      },
      {
        "reference": {
          "misc": [
            "PP65-76"
          ],
          "label": "6",
          "authors": [
            {
              "full_name": "Williams, S."
            },
            {
              "full_name": "Waterman, A."
            },
            {
              "full_name": "Patterson, D."
            }
          ],
          "publication_info": {
            "year": 2009,
            "artid": "4",
            "page_start": "4",
            "journal_title": "Commun.ACM",
            "journal_volume": "52"
          }
        }
      },
      {
        "reference": {
          "misc": [
            "PoS Lattice"
          ],
          "label": "7",
          "authors": [
            {
              "full_name": "Wagner, M."
            },
            {
              "full_name": "Clark, K."
            }
          ],
          "publication_info": {
            "year": 2016
          }
        }
      },
      {
        "reference": {
          "misc": [
            "Computer Architecture: Hennessy and Patterson, Morgan Kaufmann"
          ],
          "label": "8",
          "authors": [
            {
              "full_name": "Approach, A. Quantitative"
            }
          ]
        }
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1312379"
        },
        "reference": {
          "dois": [
            "10.1109/IPDPS.2014.112"
          ],
          "misc": [
            "and B. JoÃ¸s"
          ],
          "label": "9",
          "authors": [
            {
              "full_name": "Winter, F.T."
            },
            {
              "full_name": "Clark, M.A."
            },
            {
              "full_name": "Edwards, R.G."
            }
          ],
          "arxiv_eprint": "1408.5925"
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1297647"
        },
        "reference": {
          "label": "9",
          "authors": [
            {
              "full_name": "Winter, F.T."
            }
          ],
          "publication_info": {
            "year": 2014,
            "artid": "042",
            "page_start": "042",
            "journal_title": "PoS",
            "journal_volume": "LATTICE2013"
          }
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/954998"
        },
        "reference": {
          "label": "9",
          "authors": [
            {
              "full_name": "Winter, F."
            }
          ],
          "arxiv_eprint": "1111.5596"
        },
        "curated_relation": false
      },
      {
        "reference": {
          "misc": [
            "IBM Journal of Research and Development ( Volume: 49, Issue: 2.3, March)"
          ],
          "label": "10",
          "title": {
            "title": "Overview of the QCDSP and QCDOC computers"
          },
          "publication_info": {
            "year": 2005
          }
        }
      },
      {
        "reference": {
          "misc": [
            "IBM Journal of Research and Development, Volume 57 Issue 1, January"
          ],
          "label": "11",
          "publication_info": {
            "year": 2013
          }
        }
      },
      {
        "reference": {
          "misc": [
            "www.hotchips.org/wp-content/uploads/hc_archives/hc27/HC27.25-Tuesday-Epub/HC27.25.70Processors-Epub/HC27.25.710-Knights-Landing-Sodani-Intel.pdf"
          ],
          "label": "12"
        }
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1409303"
        },
        "reference": {
          "label": "13",
          "authors": [
            {
              "full_name": "Boyle, P.A."
            },
            {
              "full_name": "Cossu, G."
            },
            {
              "full_name": "Yamaguchi, A."
            },
            {
              "full_name": "Portelli, A."
            }
          ],
          "publication_info": {
            "year": 2016,
            "artid": "023",
            "page_start": "023",
            "journal_title": "PoS",
            "journal_volume": "LATTICE2015"
          }
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/875156"
        },
        "reference": {
          "label": "14",
          "authors": [
            {
              "full_name": "Babich, R."
            },
            {
              "full_name": "Clark, M.A."
            },
            {
              "full_name": "Joo, B."
            }
          ],
          "arxiv_eprint": "1011.0024"
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1495555"
        },
        "reference": {
          "label": "15",
          "authors": [
            {
              "full_name": "DeTar, C."
            },
            {
              "full_name": "Doerfler, D."
            },
            {
              "full_name": "Gottlieb, S."
            },
            {
              "full_name": "Jha, A."
            },
            {
              "full_name": "Kalamkar, D."
            },
            {
              "full_name": "Li, R."
            },
            {
              "full_name": "Toussaint, D."
            }
          ],
          "arxiv_eprint": "1611.00728"
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1315118"
        },
        "reference": {
          "label": "16",
          "authors": [
            {
              "full_name": "Mukherjee, S."
            },
            {
              "full_name": "Kaczmarek, O."
            },
            {
              "full_name": "Schmidt, C."
            },
            {
              "full_name": "Steinbrecher, P."
            },
            {
              "full_name": "Wagner, M."
            }
          ],
          "arxiv_eprint": "1409.1510",
          "publication_info": {
            "year": 2015,
            "artid": "044",
            "page_start": "044",
            "journal_title": "PoS",
            "journal_volume": "LATTICE2014"
          }
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1328508"
        },
        "reference": {
          "dois": [
            "10.3204/DESY-PROC-2014-05/28"
          ],
          "label": "16",
          "authors": [
            {
              "full_name": "Kaczmarek, O."
            },
            {
              "full_name": "Schmidt, C."
            },
            {
              "full_name": "Steinbrecher, P."
            },
            {
              "full_name": "Wagner, M."
            }
          ],
          "arxiv_eprint": "1411.4439"
        },
        "curated_relation": false
      },
      {
        "reference": {
          "misc": [
            "Lattice QCD on Intel Xeon Phi coprocessors. Balint Joo Karthikeyan Vaidyanathan, Mikhail Smelyanskiy, Kiran Pamnany, Victor W Lee Pradeep Dubey, and William Watson III. In Proceedings of the 2013 International Supercomputing Conference, June 2013. Joo, Balint, et al International Conference on High Performance Computing International Publishing"
          ],
          "label": "17",
          "title": {
            "title": "Optimizing Wilson-Dirac Operator and Linear Solvers for Intel KNL."
          },
          "authors": [
            {
              "full_name": "Kalamkar, Dhiraj D."
            }
          ],
          "imprint": {
            "publisher": "Springer"
          },
          "publication_info": {
            "year": 2016
          }
        }
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1502349"
        },
        "reference": {
          "label": "18",
          "authors": [
            {
              "full_name": "Jin, X.Y."
            },
            {
              "full_name": "Osborn, J.C."
            }
          ],
          "arxiv_eprint": "1612.02750",
          "publication_info": {
            "year": 2016,
            "artid": "187",
            "page_start": "187",
            "journal_title": "PoS",
            "journal_volume": "ICHEP2016"
          }
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1505182"
        },
        "reference": {
          "misc": [
            "M. D‘Mello, M. Troute and R. Vemuri PoS LATTICE261"
          ],
          "label": "19",
          "title": {
            "title": "A performance evaluation of CCS QCD Benchmark on the COMA (Intel(R) Xeon Phi, KNC) system"
          },
          "authors": [
            {
              "full_name": "Boku, T."
            },
            {
              "full_name": "Ishikawa, K.I."
            },
            {
              "full_name": "Kuramashi, Y."
            },
            {
              "full_name": "Meadows, L."
            }
          ],
          "arxiv_eprint": "1612.06556",
          "publication_info": {
            "year": 2016
          }
        },
        "curated_relation": false
      },
      {
        "reference": {
          "urls": [
            {
              "value": "https://www.nextplatform.com/2016/04/06/dgx-1-nvidias-deep-learning-system-newbies/"
            }
          ],
          "label": "20"
        }
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1216422"
        },
        "reference": {
          "misc": [
            "Edinburgh U"
          ],
          "label": "21",
          "title": {
            "title": "The BlueGene/Q supercomputer"
          },
          "authors": [
            {
              "full_name": "Boyle, P.A."
            }
          ],
          "publication_info": {
            "year": 2012,
            "artid": "020",
            "page_start": "020",
            "journal_title": "PoS",
            "journal_volume": "LATTICE2012"
          }
        },
        "curated_relation": false
      },
      {
        "reference": {
          "misc": [
            "PoS Lattice"
          ],
          "label": "22",
          "title": {
            "title": "Staggered Multigrid"
          },
          "authors": [
            {
              "full_name": "Weinberg, E."
            },
            {
              "full_name": "Brower, R."
            },
            {
              "full_name": "Clark, K."
            },
            {
              "full_name": "Strelchenko, A."
            }
          ],
          "publication_info": {
            "year": 2016
          }
        }
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1499497"
        },
        "reference": {
          "label": "23",
          "title": {
            "title": "Hierarchically deflated conjugate residual"
          },
          "authors": [
            {
              "full_name": "Yamaguchi, A."
            },
            {
              "full_name": "Boyle, P."
            }
          ],
          "arxiv_eprint": "1611.06944"
        },
        "curated_relation": false
      },
      {
        "reference": {
          "label": "24",
          "title": {
            "title": "Tofu: A 6D Mesh/Torus Inter-connect for Exascale Computers"
          },
          "authors": [
            {
              "full_name": "Ajima, Y."
            },
            {
              "full_name": "Sumimoto, S."
            },
            {
              "full_name": "Shimizu, T."
            }
          ],
          "imprint": {
            "publisher": "IEEE"
          },
          "publication_info": {
            "year": 2009,
            "page_end": "40",
            "page_start": "36",
            "journal_title": "Computer",
            "journal_volume": "42"
          }
        }
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1390978"
        },
        "reference": {
          "label": "25",
          "title": {
            "title": "Metadynamics surfing on topology barriers: the CPN1 case"
          },
          "authors": [
            {
              "full_name": "Laio, A."
            },
            {
              "full_name": "Martinelli, G."
            },
            {
              "full_name": "Sanfilippo, F."
            }
          ],
          "publication_info": {
            "year": 2016,
            "artid": "089",
            "page_start": "089",
            "journal_title": "JHEP",
            "journal_volume": "07"
          }
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1495254"
        },
        "reference": {
          "label": "26",
          "title": {
            "title": "Density-of-states"
          },
          "authors": [
            {
              "full_name": "Langfeld, K."
            }
          ],
          "arxiv_eprint": "1610.09856"
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1495028"
        },
        "reference": {
          "label": "27",
          "title": {
            "title": "Applications of Jarzynski’s relation in lattice gauge theories"
          },
          "authors": [
            {
              "full_name": "Nada, A."
            },
            {
              "full_name": "Caselle, M."
            },
            {
              "full_name": "Costagliola, G."
            },
            {
              "full_name": "Panero, M."
            },
            {
              "full_name": "Toniato, A."
            }
          ],
          "arxiv_eprint": "1610.09017"
        },
        "curated_relation": false
      },
      {
        "reference": {
          "misc": [
            "PoS Lattice"
          ],
          "label": "28",
          "title": {
            "title": "Computing the density of states with the global Hybrid Monte Carlo"
          },
          "authors": [
            {
              "full_name": "Pellegrine, R."
            },
            {
              "full_name": "Davide, V."
            },
            {
              "full_name": "Lucini, B."
            },
            {
              "full_name": "Rago, A."
            }
          ],
          "publication_info": {
            "year": 2016
          }
        }
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1495425"
        },
        "reference": {
          "label": "29",
          "title": {
            "title": "Overcoming strong metastabilities with the LLR method"
          },
          "authors": [
            {
              "full_name": "Lucini, B."
            },
            {
              "full_name": "Fall, W."
            },
            {
              "full_name": "Langfeld, K."
            }
          ],
          "arxiv_eprint": "1611.00019"
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1505179"
        },
        "reference": {
          "label": "30",
          "authors": [
            {
              "full_name": "Ce, M."
            },
            {
              "full_name": "Giusti, L."
            },
            {
              "full_name": "Schaefer, S."
            }
          ],
          "arxiv_eprint": "1612.06424"
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1485709"
        },
        "reference": {
          "label": "30",
          "authors": [
            {
              "full_name": "Ce, M."
            },
            {
              "full_name": "Giusti, L."
            },
            {
              "full_name": "Schaefer, S."
            }
          ],
          "arxiv_eprint": "1609.02419"
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1415761"
        },
        "reference": {
          "label": "30",
          "authors": [
            {
              "full_name": "Ce, M."
            },
            {
              "full_name": "Giusti, L."
            },
            {
              "full_name": "Schaefer, S."
            }
          ],
          "publication_info": {
            "year": 2016,
            "artid": "094507",
            "journal_title": "Phys.Rev.D",
            "journal_volume": "93"
          }
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1495722"
        },
        "reference": {
          "label": "31",
          "authors": [
            {
              "full_name": "Bacchio, S."
            },
            {
              "full_name": "Alexandrou, C."
            },
            {
              "full_name": "Finkenrath, J."
            },
            {
              "full_name": "Frommer, A."
            },
            {
              "full_name": "Kahl, K."
            },
            {
              "full_name": "Rottmann, M."
            }
          ],
          "arxiv_eprint": "1611.01034"
        },
        "curated_relation": false
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1490670"
        },
        "reference": {
          "label": "31",
          "authors": [
            {
              "full_name": "Alexandrou, C."
            },
            {
              "full_name": "Bacchio, S."
            },
            {
              "full_name": "Finkenrath, J."
            },
            {
              "full_name": "Frommer, A."
            },
            {
              "full_name": "Kahl, K."
            },
            {
              "full_name": "Rottmann, M."
            }
          ],
          "publication_info": {
            "year": 2016,
            "artid": "114509",
            "journal_title": "Phys.Rev.D",
            "journal_volume": "94"
          }
        },
        "curated_relation": false
      },
      {
        "reference": {
          "misc": [
            "PoS Lattice"
          ],
          "label": "32",
          "title": {
            "title": "Domain Wall Fermion Simulations with the Exact One-Flavor Algorithm"
          },
          "authors": [
            {
              "full_name": "Murphy, D."
            }
          ],
          "publication_info": {
            "year": 2016
          }
        }
      }
    ],
    "arxiv_eprints": [
      {
        "value": "1702.00208",
        "categories": [
          "hep-lat",
          "physics.comp-ph"
        ]
      }
    ],
    "document_type": [
      "conference paper"
    ],
    "preprint_date": "2017-02-01",
    "control_number": 1511682,
    "legacy_version": "20190409165654.0",
    "deleted_records": [
      {
        "$ref": "https://inspirehep.net/api/literature/1589506"
      }
    ],
    "number_of_pages": 12,
    "inspire_categories": [
      {
        "term": "Lattice"
      },
      {
        "term": "Computing"
      }
    ],
    "legacy_creation_date": "2017-02-02"
  },
  "created": "2017-02-02T00:00:00+00:00"
}