{
  "uuid": "f244db0c-44c7-46fa-a9ef-f0188725d39f",
  "created": "2017-11-16T00:00:00+00:00",
  "metadata": {
    "citation_count": 4,
    "citation_count_without_self_citations": 3,
    "core": true,
    "titles": [
      {
        "title": "Accelerating HPC codes on Intel(R) Omni-Path Architecture networks: From particle physics to Machine Learning",
        "source": "arXiv"
      }
    ],
    "$schema": "https://inspirehep.net/schemas/records/hep.json",
    "authors": [
      {
        "uuid": "ea7cf3f3-a70f-44b2-a7e2-c67594446173",
        "record": {
          "$ref": "https://inspirehep.net/api/authors/1015525"
        },
        "full_name": "Boyle, Peter",
        "affiliations": [
          {
            "value": "Edinburgh U.",
            "record": {
              "$ref": "https://inspirehep.net/api/institutions/902787"
            }
          }
        ],
        "signature_block": "BYLp",
        "curated_relation": true
      },
      {
        "uuid": "4157c8ca-fb75-4d8c-9449-491aa92486c4",
        "record": {
          "$ref": "https://inspirehep.net/api/authors/2442196"
        },
        "full_name": "Chuvelev, Michael",
        "affiliations": [
          {
            "value": "Intel, Santa Clara",
            "record": {
              "$ref": "https://inspirehep.net/api/institutions/906225"
            }
          }
        ],
        "signature_block": "CAVALAFm"
      },
      {
        "uuid": "a68515fd-a402-4c37-90da-9e072c48a63f",
        "record": {
          "$ref": "https://inspirehep.net/api/authors/1045359"
        },
        "full_name": "Cossu, Guido",
        "affiliations": [
          {
            "value": "Edinburgh U.",
            "record": {
              "$ref": "https://inspirehep.net/api/institutions/902787"
            }
          },
          {
            "value": "KEK, Tsukuba",
            "record": {
              "$ref": "https://inspirehep.net/api/institutions/902916"
            }
          }
        ],
        "signature_block": "CASg"
      },
      {
        "uuid": "6285132d-dfff-4cd8-a20e-e47ae4af4cbd",
        "record": {
          "$ref": "https://inspirehep.net/api/authors/1721047"
        },
        "full_name": "Kelly, Christopher",
        "affiliations": [
          {
            "value": "Columbia U.",
            "record": {
              "$ref": "https://inspirehep.net/api/institutions/902749"
            }
          },
          {
            "value": "Brookhaven",
            "record": {
              "$ref": "https://inspirehep.net/api/institutions/902689"
            }
          }
        ],
        "signature_block": "CALYc",
        "curated_relation": true
      },
      {
        "uuid": "501d1483-ff5d-4819-b1fa-827c2a3dafbc",
        "record": {
          "$ref": "https://inspirehep.net/api/authors/1063281"
        },
        "full_name": "Lehner, Christoph",
        "affiliations": [
          {
            "value": "Brookhaven Natl. Lab.",
            "record": {
              "$ref": "https://inspirehep.net/api/institutions/1268258"
            }
          },
          {
            "value": "Brookhaven",
            "record": {
              "$ref": "https://inspirehep.net/api/institutions/902689"
            }
          }
        ],
        "signature_block": "LANARc",
        "curated_relation": true
      },
      {
        "uuid": "bc16572a-8058-487c-add0-7d37c831bc84",
        "record": {
          "$ref": "https://inspirehep.net/api/authors/2442199"
        },
        "full_name": "Meadows, Lawrence",
        "affiliations": [
          {
            "value": "Intel, Santa Clara",
            "record": {
              "$ref": "https://inspirehep.net/api/institutions/906225"
            }
          }
        ],
        "signature_block": "MADl"
      }
    ],
    "curated": true,
    "figures": [
      {
        "key": "1f0561b0b0eb0ff9cffad5107f1d2581",
        "url": "https://inspirehep.net/files/1f0561b0b0eb0ff9cffad5107f1d2581",
        "source": "arxiv",
        "caption": "Bandwidth delivered before and after optimisation steps were taken. Only the (blocking) MPI calls themselves were included in the timing. In the optimised code 35GB/s bidirectional bandwidth is delivered, and that is 70\\% of line-rate, and contrasts well against the 10\\% delivered by the original code.",
        "filename": "all.png"
      },
      {
        "key": "63d959a6c1d516522fc112f59d931002",
        "url": "https://inspirehep.net/files/63d959a6c1d516522fc112f59d931002",
        "source": "arxiv",
        "caption": "Total time for the communication vs. vector length before and after our optimisation. Latency dominates for small vectors, but a substantial gain is possible in the large packet/bandwidth limited end of the curve. Note the scale is logarithmic.",
        "filename": "time_comms.png"
      },
      {
        "key": "62b01120d92e330e4689391bffdbe704",
        "url": "https://inspirehep.net/files/62b01120d92e330e4689391bffdbe704",
        "source": "arxiv",
        "caption": "Total time for all computation, buffer copying and memory allocation vs. vector length before and after our optimisation. Threading overhead dominates the ``optimised'' code on small vector lengths with 64 active threads. Clearly, further optimisation is possible, making the thead count used vary with the vector length would yield a solution that is always optimal for all vector lengths. This study does not attempt to do this and focuses on the large vector performance since this is most relevant for large, complex and demanding neural network problems.",
        "filename": "time_compute.png"
      },
      {
        "key": "e0cc8b0d62cb489cc1803bba0a893514",
        "url": "https://inspirehep.net/files/e0cc8b0d62cb489cc1803bba0a893514",
        "source": "arxiv",
        "caption": "Percentage of time spent in communications calls vs. vector length before and after our optimisation. The computation becomes dominant at large vector lengths in the original code, but is sub-dominant in the optimised code \\emph{despite} the optimised code being over 10x faster. This is not unreasonable since the threading of the relevant loops should gain a factor of O(64) on many core processors.",
        "filename": "percent_comms.png"
      },
      {
        "key": "f2b8ce76cced0a631f8c5667215abb63",
        "url": "https://inspirehep.net/files/f2b8ce76cced0a631f8c5667215abb63",
        "source": "arxiv",
        "caption": "Wall clock time per reduction call vs. vector length before and after our optimisation. The large vector reduction performance is ten times better after our optimisations on large vector lengths. The gain includes both computation acceleration and communication acceleration.",
        "filename": "time_total.png"
      }
    ],
    "license": [
      {
        "url": "http://arxiv.org/licenses/nonexclusive-distrib/1.0/",
        "imposing": "arXiv"
      }
    ],
    "texkeys": [
      "Boyle:2017xcy"
    ],
    "citeable": true,
    "keywords": [
      {
        "value": "programming",
        "schema": "INSPIRE"
      },
      {
        "value": "performance",
        "schema": "INSPIRE"
      },
      {
        "value": "computer: network",
        "schema": "INSPIRE"
      },
      {
        "value": "data management",
        "schema": "INSPIRE"
      },
      {
        "value": "computer: communications",
        "schema": "INSPIRE"
      },
      {
        "value": "multiprocessor",
        "schema": "INSPIRE"
      }
    ],
    "abstracts": [
      {
        "value": "We discuss practical methods to ensure near wirespeed performance from clusters with either one or two Intel(R) Omni-Path host fabric interfaces (HFI) per node, and Intel(R) Xeon Phi(TM) 72xx (Knight's Landing) processors, and using the Linux operating system. The study evaluates the performance improvements achievable and the required programming approaches in two distinct example problems: firstly in Cartesian communicator halo exchange problems, appropriate for structured grid PDE solvers that arise in quantum chromodynamics simulations of particle physics, and secondly in gradient reduction appropriate to synchronous stochastic gradient descent for machine learning. As an example, we accelerate a published Baidu Research reduction code and obtain a factor of ten speedup over the original code using the techniques discussed in this paper. This displays how a factor of ten speedup in strongly scaled distributed machine learning could be achieved when synchronous stochastic gradient descent is massively parallelised with a fixed mini-batch size. We find a significant improvement in performance robustness when memory is obtained using carefully allocated 2MB \"huge\" virtual memory pages, implying that either non-standard allocation routines should be used for communication buffers. These can be accessed via a LD\\_PRELOAD override in the manner suggested by libhugetlbfs. We make use of a the Intel(R) MPI 2019 library \"Technology Preview\" and underlying software to enable thread concurrency throughout the communication software stake via multiple PSM2 endpoints per process and use of multiple independent MPI communicators. When using a single MPI process per node, we find that this greatly accelerates delivered bandwidth in many core Intel(R) Xeon Phi processors.",
        "source": "arXiv"
      }
    ],
    "references": [
      {
        "reference": {
          "dois": [
            "10.1109/MM.2016.58"
          ],
          "misc": [
            "Enabling Scalable High-Performance Systems with the Intel Omni-Path Architecture Mark Debbage, Ram Huggahalli, James Kunz, Tom Lovett, Todd Rimmer, Keith D. Underwood, Robert C. Zak Intel IEEE Micro ( Volume: 36, Issue: 4, July-Aug.)"
          ],
          "label": "1",
          "authors": [
            {
              "full_name": "Birrittella, Mark S."
            }
          ],
          "publication_info": {
            "year": 2016
          }
        }
      },
      {
        "reference": {
          "urls": [
            {
              "value": "http://www.hoti.org/hoti23/slides/rimmer.pdf"
            }
          ],
          "label": "2"
        }
      },
      {
        "reference": {
          "dois": [
            "10.1109/HOTCHIPS.2015.7477467"
          ],
          "misc": [
            "Knights landing (KNL): 2nd Generation Intel® Xeon Phi processor, Avinash Sodani, Intel Corp Hot Chips 27 Symposium (HCS)"
          ],
          "label": "3",
          "imprint": {
            "publisher": "IEEE"
          },
          "publication_info": {
            "year": 2015
          }
        }
      },
      {
        "reference": {
          "urls": [
            {
              "value": "http://www.oracle.com/technetwork/articles/servers-storage-dev/hugepages-2099009.html"
            }
          ],
          "label": "4"
        }
      },
      {
        "record": {
          "$ref": "https://inspirehep.net/api/literature/1409303"
        },
        "reference": {
          "misc": [
            "SISSA (-07-15) Conference: C15-07-14"
          ],
          "urls": [
            {
              "value": "https://pos.sissa.it/251/023/pdf"
            }
          ],
          "label": "5",
          "publication_info": {
            "year": 2016,
            "artid": "023",
            "page_start": "023",
            "journal_title": "PoS",
            "journal_volume": "LATTICE2015"
          }
        },
        "curated_relation": false
      },
      {
        "reference": {
          "urls": [
            {
              "value": "https://github.com/baidu-research/DeepBench"
            }
          ],
          "label": "6"
        }
      },
      {
        "reference": {
          "misc": [
            "research.baidu.com/bringing-hpc-techniques-deep-learning/"
          ],
          "urls": [
            {
              "value": "https://github.com/baidu-research/baidu-allreduce"
            }
          ],
          "label": "7"
        }
      },
      {
        "reference": {
          "misc": [
            "It is worth commenting that in the Grid code above, a STL compliant C++ allocator is used consistently in the code, using template typedefs, to replace the standard allocator. The ten most recently deallocated large vector allocations are cached, with a round robin victim and lazy release, so that repeated reallocation of same sized vectors is efficient. This avoids releasing memory to the operating system in tight inner loops. Memory allocation caching is a valid and useful HPC optimisation and can be made generally applicable. It is even relatively simple if consistently using STL vectors and template typedef’s"
          ],
          "label": "8"
        }
      }
    ],
    "public_notes": [
      {
        "value": "17 pages, 5 figures",
        "source": "arXiv"
      }
    ],
    "arxiv_eprints": [
      {
        "value": "1711.04883",
        "categories": [
          "cs.DC",
          "cs.AI",
          "hep-lat"
        ]
      }
    ],
    "document_type": [
      "article"
    ],
    "preprint_date": "2017-11-13",
    "control_number": 1636204,
    "legacy_version": "20180108221721.0",
    "number_of_pages": 16,
    "inspire_categories": [
      {
        "term": "Computing"
      },
      {
        "term": "Lattice"
      }
    ],
    "legacy_creation_date": "2017-11-16"
  },
  "revision_id": 111,
  "links": {
    "bibtex": "https://inspirehep.net/api/literature/1636204?format=bibtex",
    "latex-eu": "https://inspirehep.net/api/literature/1636204?format=latex-eu",
    "latex-us": "https://inspirehep.net/api/literature/1636204?format=latex-us",
    "json": "https://inspirehep.net/api/literature/1636204?format=json",
    "json-expanded": "https://inspirehep.net/api/literature/1636204?format=json-expanded",
    "cv": "https://inspirehep.net/api/literature/1636204?format=cv",
    "citations": "https://inspirehep.net/api/literature/?q=refersto%3Arecid%3A1636204"
  },
  "id": "1636204",
  "updated": "2025-08-04T17:25:05.961337+00:00"
}