# Hierarchical Computations on Manycore Architectures: The HiCMA Library

David Keyes, Applied Mathematics & Computational Science Director, Extreme Computing Research Center (ECRC) King Abdullah University of Science and Technology david.keyes@kaust.edu.sa → Abbreviated and updated version of the archived "SIAM Presents" plenary

https://www.pathlms.com/siam/courses/4150/sections/5818

#### from SIAM CSE, 28 Feb 2017:

"Algorithmic Adaptations to Extreme Scale"

# **Two decades of evolution in hardware** 1997 2017



# ASCI Red at Sandia 1.3 TF/s, 850 KW

#### Cavium ThunderX2 (ARM) ~1.1 TF/s, ~0.2 KW

3.5 orders of

magnitude

#### **Supercomputer in a node**

| System                    | Peak DP | Peak Power | Power<br>Efficiency |
|---------------------------|---------|------------|---------------------|
|                           | TFlop/s | KW         | GFlop/s/Watt        |
| ASCI Red                  | 1.3     | 850        | 0.0015              |
| ThunderX2<br>Cavium (ARM) | 1.1     | 0.20       | 5.5                 |

### Supercomputer in a node

| System                     | Peak DP<br>TFlop/s | Peak Power<br>KW | Power<br>Efficiency<br>GFlop/s/Watt |
|----------------------------|--------------------|------------------|-------------------------------------|
| ASCI Red                   | 1.3                | 850              | 0.0015                              |
| ThunderX2<br>Cavium (ARM)  | 1.1                | 0.20             | 5.5*                                |
| Knights Landing<br>Intel   | 3.5                | 0.26             | 14                                  |
| P100 Pascal<br>NVIDIA      | 5.3                | 0.30             | 18                                  |
| Taihu Light<br>CAS         | 125,000            | 15,000           | 8.3                                 |
| Exascale System<br>(~2021) | 1,000,000          | 20,000           | 50                                  |

\* 8 memory channels in Cavium ARM vs. 6 for Intel KNL

#### **Architectural trends**

- Clock rates cease to increase while arithmetic capability continues to increase through concurrency (flooding of cores)
- Memory storage capacity increases, but fails to keep up with arithmetic capability *per core*
- Transmission capability memory BW and network BW – increases, but fails to keep up with arithmetic capability *per core*

# Well established resource trade-offs

- Communication-avoiding algorithms
  - exploit extra memory to achieve theoretical lower bound on communication volume
- Synchronization-avoiding algorithms
  - perform extra flops between global reductions or exchanges to require fewer global operations
- High-order discretizations
  - perform more flops per degree of freedom
     (DOF) to store and manipulate fewer DOFs



# \$€£¥

of scientific software worldwide hangs in the balance, until our algorithmic infrastructure evolves to span the architecture-applications gap

## **Architectural background** www.exascale.org/iesp



ROADMAP 1.0



Richard Kenway

David Keyes

Bill Kramer

Jesus Labarta

Alain Lichnewsky

Thomas Lippert Bob Lucas

Barnev Maccabe

Satoshi Matsuoka

Paul Messina

Peter Michielse

Bernd Mohr

Jack Dongarra Pete Beckman Terry Moore Patrick Aerts Giovanni Aloisio Jean-Claude Andre David Barkai lean-Yves Berthou Taisuke Boku Bertrand Braunschweig Franck Cappello Barbara Chapman Xuebin Chi

Alok Choudhary Sudip Dosanjh Thom Dunning Sandro Fiore Al Geist Bill Gropp Robert Harrison Mark Hereld Michael Heroux Adolfy Hoisie Koh Hotta Yutaka Ishikawa Fred Johnson

SPONSORS



EPSRC

R



FUITSU



Wolfgang Nagel

Hiroshi Nakashima

Michael E. Papka

Dan Reed

Mitsuhisa Sato

Ed Seidel

John Shalf

David Skinner

Marc Snir

Thomas Sterling

Rick Stevens

Fred Streitz



🔧 東京大学



Shinji Sumimoto

William Tang

John Taylor

Rajeev Thakur

Anne Trefethen

Mateo Valero

Aad van der Steen

Jeffrev Vetter

Peg Williams

Robert Wisniewski

Kathy Yelick

CERF/O

The International Exascale Software Roadmap

J. Dongarra, P. Beckman, et al., International Journal of *High Performance Computer* Applications 25:3-60, 2011.

CRAY







 $\mathbf{T}$ 



# **Uptake from IESP meetings**

- While obtaining the next order of magnitude of performance, we need another order of performance efficiency
  - target: 50 Gigaflop/s/W, today typically ~ 5 Gigaflop/s/W
- Required reduction in power per flop and per byte may make computing and moving data less reliable
  - circuit elements will be smaller and subject to greater physical noise per signal, with less space redundancy and/or time redundancy for resilience in the hardware
- Power may be cycled off and on, or clocks slowed and speeded
  - may be scheduled, based on phases with different power requirements, or may be dynamic from thermal monitoring
- Performance rates less reliable

## Node-based "weak scaling" is routine; thread-based "strong scaling" is the game

- An exascale configuration: 1 million 1000-way 1GHz nodes
- Expanding the number of nodes (processor-memory units) beyond 10<sup>6</sup> would *not* be a serious threat to algorithms that lend themselves to well-amortized precise load balancing
  - provided that the nodes are performance reliable
- Real challenge is usefully expanding the number of cores sharing memory on a node to 10<sup>3</sup>
  - must be done while memory and memory bandwidth per node expand by (at best) ten-fold less (basically "strong" scaling)

# $\rightarrow$ Don't need to wait for full exascale systems to experiment in this regime...



The contest is being waged on individual sharedmemory nodes today

### The familiar



## The challenge



How are most scientific simulations implemented at the petascale today?

- Iterative methods based on data decomposition and message-passing
  - data structures are distributed
  - each individual processor works on a subdomain of the original
  - exchanges information with other processors that own data with which it interacts causally, to evolve in time or to establish equilibrium
  - computation and neighbor communication are both fully parallelized and their ratio remains constant in weak scaling
- The programming model is BSP/SPMD/CSP
  - Bulk Synchronous Programming
  - Single Program, Multiple Data
  - Communicating Sequential Processes

Three decades of stability in programming model

## Bulk Synchronous Parallelism



# Leslie Valiant, F.R.S., N.A.S. 2010 Turing Award Winner



Comm. of the ACM, 1990

#### **BSP** parallelism w/ domain decomposition



## **BSP** has an impressive legacy

By the Gordon Bell Prize, performance on *real applications* (e.g., mechanics, materials, petroleum reservoirs, etc.) has improved *more* than a million times in two decades. Simulation *cost per performance* has improved by nearly a million times.

| Gordon Bell<br>Prize: Peak<br>Performance<br><b>Year</b> | Gigaflop/s<br>delivered to<br>applications | Gordon Bell<br>Prize: Price<br>Performance | Cost per<br>delivered<br>Gigaflop/s |
|----------------------------------------------------------|--------------------------------------------|--------------------------------------------|-------------------------------------|
| 1988                                                     | 1                                          | 1989                                       | \$2,500,000                         |
| 1998                                                     | 1,020                                      | 1999                                       | \$6 <i>,</i> 900                    |
| 2008                                                     | 1,350,000                                  | 2009                                       | \$8                                 |

# **Riding exponentials**

- Proceeded steadily for decades from giga- (1988) to tera- (1998) to peta- (2008) with
  - same BSP programming model
  - same assumptions about who (hardware, systems software, applications software, etc.) is responsible for what (resilience, performance, processor mapping, etc.)
  - same classes of algorithms (cf. 25 yrs. of Gordon Bell Prizes)
- Scientific computing now at a crossroads with respect to extreme scale

# Main challenge going forward for BSP

- Almost all "good" algorithms in linear algebra, differential equations, integral equations, signal analysis, etc., like to globally synchronize – and frequently!
  - inner products, norms, pivots, fresh residuals are "addictive" idioms
  - tends to hurt efficiency beyond 100,000 processors
  - can be fragile for smaller concurrency, as well, due to algorithmic load imbalance, hardware performance variation, etc.
- Concurrency is heading into the billions of cores
  - already 10 million on the most powerful system today

# Energy-aware generation

# BSP generation

# Some algorithmic imperatives

- Reduce communication and synchrony
  - in frequency and/or span
  - see Grigori, Jolivet, Chow in this MS 7
- Reside "high" on the memory hierarchy
  - as close as possible to the processing elements
  - see Li, Schlatter, Dubey, Bader in this MS 7/16
- Increase SIMT/SIMD-style shared-memory concurrency
  - see many of the above

## Widely applicable strategies

- 1) Employ dynamic runtime systems based on directed acyclic task graphs (DAGs)
  - e.g., Charm++, Quark, StarPU, Legion, OmpSs, HPX, ADLB, Argo
- 2) Exploit data sparsity of hierarchically lowrank type
  - meet the "curse of dimensionality" with the "blessing of low rank"
- 3) Code to the architecture, but present an abstract API

# 1) Taskification based on DAGs

- Advantages
  - remove artifactual synchronizations in the form of subroutine boundaries
  - remove artifactual orderings in the form of prescheduled loops
  - expose more concurrency
- Disadvantages
  - pay overhead of managing task graph
  - potentially lose some memory locality

# 2) Hierarchically low-rank operators

#### • Advantages

- shrink memory footprints to live higher on the memory hierarchy
  - higher means quick access (↑ arithmetic intensity)
- reduce operation counts
- tune work to accuracy requirements
  - e.g., preconditioner versus solver
- Disadvantages
  - pay cost of compression
  - not all operators compress well

# 3) Code to the architecture

#### • Advantages

- tiling and recursive subdivision create large numbers of small problems suitable for batched operations on GPUs and MICs
  - reduce call overheads
  - polyalgorithmic approach based on block size
- non-temporal stores, coalesced memory accesses, double-buffering, etc. reduce sensitivity to memory
- Disadvantages
  - code is more complex
  - code is architecture-specific at the bottom

#### **Reducing over-ordering and synchronization through DAGs, ex.: generalized eigensolver**

#### $Ax = \lambda Bx$



# Loop nests and subroutine calls, with their over-orderings, can be replaced with DAGs

- Diagram shows a dataflow ordering of the steps of a 4×4 symmetric generalized eigensolver
- Nodes are tasks, colorcoded by type, and edges are data dependencies
- Time is vertically downward
- Wide is good; short is good



## Loops can be overlapped in time

Green, blue and magenta symbols represent tasks in separate loop bodies with dependences from an adaptive optics computation





# **DAG-based safe out-of-order execution**



17:13

18:16

19:9







c/o H. Ltaief (KAUST) & D. Gratadour (OdP)

**Reducing memory footprint and operation complexity with low rank** 

- When dense blocks arise in matrix operations, replace them with hierarchical representations
- Use high accuracy (high rank, but typically less than full) to build "exact" solvers
- Use low accuracy (low rank) to build preconditioners
- Block structure and rank provide useful tuning parameters for migration onto variety of hardware configurations

## **Key tool: hierarchical matrices**

- [Hackbusch, 1999] : off-diagonal blocks of typical differential and integral operators have low effective rank
- By exploiting low rank, k, memory requirements and operation counts approach optimal in matrix dimension n:
  - **polynomial in** k
  - lin-log in n
  - constants carry the day
- Such hierarchical representations navigate a compromise
  - fewer blocks of larger rank ("weak admissibility") or
  - more blocks of smaller rank ("strong admissibility")

## **Example: 1D Laplacian**



#### **Recursive construction of an** *H***-matrix**







c/o W. Boukaram & G. Turkiyyah (KAUST)

### "Standard (strong)" vs. "weak" admissibility



strong admissibility

weak admissibility

After Hackbusch, et al., 2003

## Some solvers that leverage data sparsity

Please notify if you have released one that is not here: gustavo.chavez

@kaust.edu.sa

| Package            | Technique                                    | Format          | Contact                                 |
|--------------------|----------------------------------------------|-----------------|-----------------------------------------|
| ACR                | Cyclic reduction                             | н               | G. Chávez, G. Turkiyyah<br>D. Keyes 118 |
| AHMED              | $\mathcal{H}^{-1}$ & $\mathcal{H}$ -LU       | н               | M. Bebendorf [120]                      |
| BLR PaStiX         | Supernodal                                   | BLR             | G. Pichon<br>M. Faverge 44              |
| CE                 | $\mathcal{H}^2$ -LU                          | $\mathcal{H}^2$ | D. Sushnikova<br>I. Oseledets [57]      |
| DMHIF              | Multifrontal                                 | ID              | Y. Li<br>L. Ying [43]                   |
| DMHM               | Newton-Schulz                                | H               | Y. Li<br>L. Ying [121]                  |
| ExaFMM             | Fast multipole                               | н               | H. Ibeid, R. Yokota<br>D. Keyes [122]   |
| H2Lib              | $\mathcal{H}^{-1}$ & $\mathcal{H}^2$ -LU     | $\mathcal{H}^2$ | S. Christophersen<br>S. Böerm[123]      |
| HLib               | $\mathcal{H}^{-1}$ & $\mathcal{H}$ -LU       | H               | L. Grasedyck<br>W. Hackbusch 124        |
| HLibPro            | $\mathcal{H}^{-1}$ & $\mathcal{H}\text{-LU}$ | н               | R. Kriemann<br>W. Hackbusch 54          |
| LoRaSp             | $\mathcal{H}^2$ -LU                          | $\mathcal{H}^2$ | H. Pouransari<br>E. Darve 56            |
| MF-HODLR           | Multifrontal                                 | HODLR           | A. Aminfar<br>E. Darve 30               |
| MUMPS-BLR          | Multifrontal                                 | BLR             | T. Mary<br>P. R. Amestoy [22]           |
| Structured CHOLMOD | Supernodal                                   | BLR             | J. Chadwick<br>D. Bindel 46             |
| STRUMPACK-Sparse   | Multifrontal                                 | HSS             | FH. Rouet, P. Ghysels<br>X.S. Li 36     |



c/o G. Chavez (KAUST)

## **"Hourglass" model for algorithms** (borrowed from internet protocols)





# Hierarchical Computations on Manycore Architectures: HiCMA\*



\* "Hikmah" is the Arabic word for wisdom

## QDWH\*-EVD/SVD

- DAG-based dataflow tile algorithms for (eigen- and) singular value decomposition
- ♦ Reduces synchrony
- Increases SIMT-style concurrency through recursion
- Employs Chameleon tile library and StarPU dynamic runtime system

\*QR-based Dynamically Weighted Halley iteration from Stable and Efficient Spectral Divide and Conquer Algorithms for the Symmetric Eigenvalue Decomposition and the SVD, Y. Nakatsukasa & N. Higham, SISC (2013)

Asynchronous Task-Based Polar Decomposition on Massively Parallel Systems, D. Sukkari, H. Ltaief, M. Faverge & D. Keyes, IEEE TPDS (2017)



• Obtain SVD from a polar decomposition:

polarsym eigen
$$A = U_p H$$
 $H = V \Sigma V^*$ 

$$\bullet A = U_p V \Sigma V^* = U \Sigma V^*$$

- QDWH iteration is a recursive divide-and-conquer method, backward stable
- Based on vendor-optimized kernels, i.e., Cholesky/QR factorizations and GEMM
- Complexity:

(10+2/3)  $n^3$  for well-conditioned system,  $43n^3$  for ill

## **QDWH-SVD**



c/o D. Sukkari & H. Ltaief (KAUST) Sukkari et al., Best papers, Europar'16 available: https://github.com/ecrc/qdwh.git

## **QDWH-SVD**



Is being integrated into Cray's LibSci w/A. Esposito (Cray) Extensions underway to Zolotarev's method w/Y. Nakatsukasa (Oxford)



Time (s)

c/o D. Sukkari & H. Ltaief (KAUST) Sukkari et al., Best papers, Europar'16 available: https://github.com/ecrc/qdwh.git

## QDWH-SVD, taskified



## **QDWH-SVD**, taskified on hybrid architecture



c/o D. Sukkari, H. Ltaief (KAUST) & M. Faverge (INRIA)

Time (s)

#### **QDWH-SVD**, taskified on various architectures



c/o D. Sukkari, H. Ltaief (KAUST) & M. Faverge (INRIA)

#### **Tile Low-rank Cholesky**

- A low-rank, but flat (not hierarchical) first step towards expanding capability for large dense symmetric problems, e.g., covariance matrices
- ♦ Reduces synchrony
- ♦ Increases SIMT-style concurrency
  - Employs OpenMP taskification pragmas and HLibPro on individual tiles

ExaGeoStat: A High Performance Unified Framework for Geostatistics on Manycore Systems S. Abdulah, H. Ltaief, Y. Sun, M. Genton & D. Keyes TDPS (2017, submitted)

#### Large dense symmetric systems arise as covariance matrices in spatial statistics

- Climate and weather applications have many measurements located regularly or irregularly in a region; prediction is needed at other locations
- Modeled as realization of Gaussian or Matérn spatial random field, with parameters to be fit
- Leads to evaluating the log-likelihood function involving a large dense (but data sparse) covariance

$$\ell(\boldsymbol{\theta}) = -\frac{1}{2} \mathbf{Z}^T \Sigma^{-1}(\boldsymbol{\theta}) \mathbf{Z} - \frac{1}{2} \log |\Sigma(\boldsymbol{\theta})|$$

#### Synthetic and practical examples

0.1 0 00 0.8 0 0.6 0 C 0 0.4 0.2 0.0 0 0.0 0.2 0.4 0.6 0.8 1.0

> 362 measured points and 38 target points irregularly distributed in unit square

Global temperature data on sphere

#### LAPACK DPOTRF

• Classical algorithm (1990s) involves BLAS L2 panel updates and BLAS L3 trailing matrix updates



#### **PLASMA/CHAMELEON DPOTRF**

• Tile algorithm (PLASMA, FLAME, 2010s) involves mostly BLAS L3 operations within tiles scheduled with a DAG



#### **Tile operations for TLR version of Cholesky**

DPOTRF: The kernel performs the Cholesky factorization of a diagonal (lower triangular) tile. It is similar to DPOTRF since the diagonal tiles are dense.

DTRSM: The operation applies an update to an off-diagonal low-rank tile of the input matrix, resulting from factorization of the diagonal tile above it and overrides it with the final elements of the output matrix: V<sub>(i,k)</sub> = V<sub>(i,k)</sub> × D<sup>-1</sup><sub>(k,k)</sub>. The operation is a triangular solve.

**DSYRK**: The kernel applies updates to a diagonal (lower triangular) tile of the input matrix, resulting from factorization of the low-rank tiles to the left of it:  $D_{(j,j)} = D_{(j,j)} - (U_{(j,k)} \times V_{(j,k)}^T) \times (U_{(j,k)} \times V_{(j,k)}^T)^T$ . The operation is a symmetric rank-k update.

DGEMM: The operation applies updates to an off-diagonal low-rank tile of the input matrix, resulting from factorization of the low-rank tiles to the left of it. The operation involves two QR factorizations, one reduced SVD (depending on the rank and/or the accuracy parameter) and two matrix-matrix multiplications.

#### **Data-sparse operations for Cholesky variants**



## Even "brute force" tilings pay off (block low-rank without hierarchy)





# Tile low-rank Cholesky, time per backsolve



On 2-socket 18-core Intel Haswell @ 2.3GHz

**OpenMP pragmas for taskification and accuracy of 10-9** 



c/o H. Ltaief & K. Akbudak (KAUST)

## **Distributed memory TLR Cholesky –** preliminary



on to notes of 2 source to core inter maswell (a

c/o H. Ltaief & K. Akbudak (KAUST)

#### **KBLAS**

- Subset of L2/L3 BLAS targeting GPU and Intel MIC
  - ♦ GEMV, SYMV, TRSM, TRMM
- Reduces communication and increases concurrency in these memory BW bound operations
- Batched BLAS for small sizes on GPUs
  - TRSM, TRMM, SYRK, POTRF, POTRS, POSV,
     TRTRI, LAUUM, POTRI, POTI
- ♦ Recursive formulation
- Employs vendor-optimized L3 BLAS underneath ACM TOMS (2016), CCPE (2016, 2017)



## Recursively defined KBLAS operations for symmetric systems

|                                   |           | 4                                                                       |          |                   |
|-----------------------------------|-----------|-------------------------------------------------------------------------|----------|-------------------|
|                                   | D. TDOM   | $\int A_1 X_1 = \alpha B_1$                                             | RecTRSM  | AI B1             |
| $TRSM : A X = \alpha B$           | RecTRSM:  | $\begin{cases} B_2 = \alpha \ B_2 - A_2 \ B_1 \end{cases}$              | GEMM     | A2 A3 B2          |
|                                   |           | $A_3 X_2 = B_2$                                                         | RecTRSM  |                   |
| $TRMM: B = \alpha \ A^T \ B$      | RecTRMM:  | $B_1 = \alpha A_1^T B_1$                                                | RecTRMM  | AI B1<br>A2 A3 B2 |
|                                   |           | $\begin{cases} B_1 = \alpha A_2^T B_2 + B_1 \end{cases}$                | GEMM     |                   |
|                                   |           | $B_2 = \alpha A_3^T B_2$                                                | RecTRMM  |                   |
| $SYRK: B = \alpha AA^T + \beta B$ |           | $B_1 = \alpha A_1 A_1^T + \beta B_1$                                    | RecSYRK  | A1 B1             |
|                                   | RecSYRK:  | $\begin{cases} B_2 = \alpha A_2 A_1^T + \beta B_2 \end{cases}$          | GEMM     | A2 B2 B3          |
|                                   |           | $B_3 = \alpha A_2 A_2^T + \beta B_3$                                    | RecSYRK  |                   |
| $POTRF: A = L L^T$                | RecPOTRF: | $(A_1 = L_1 L_1^T)$                                                     | RecPOTRF | N                 |
|                                   |           | $A_1 X = A_2$                                                           | RecTRSM  | AI                |
|                                   |           | $A_3 = -A_2 A_2^T + A_3$                                                | RecSYRK  | A2 A3             |
|                                   |           | $ \begin{cases} A_3 = -A_2 A_2^T + A_3 \\ A_3 = L_3 L_3^T \end{cases} $ | RecPOTRF |                   |
|                                   |           |                                                                         |          |                   |

c/o A. Charara & H. Ltaief (KAUST)



## **KBLAS DTRMM**



available: https://github.com/ecrc/kblas



## **KBLAS DTRSM**





# **KBLAS now in cuBLAS 8;** will be in cuBLAS 9

|                                | CUDA TOOLKIT DOCUMENTATION                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |                         |
|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|-------------------------|
| CUDA Toolkit v8.0              | C. Acknowledgements                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                          |                         |
| cuBLAS                         | NVIDIA would like to thank the following individuals and institutions for their contributions:                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |                         |
| ▷ 1. Introduction              |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              |                         |
| ≥ 2. Using the cuBLAS API      | <ul> <li>Portions of the SGEMM, DGEMM, CGEMM and ZGEMM library routines were written by Vasily Volkov of the University</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           | sity of                 |
| > 3. Using the CUBLASXT API    | California.                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  | 120                     |
| A. Using the cuBLAS Legacy API | <ul> <li>Portions of the SGEMM, DGEMM and ZGEMM library routines were written by Davide Barbieri of the University of<br/>Tor Vergata.</li> </ul>                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            | Rome                    |
| B. cuBLAS Fortran Bindings     |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                              | versity                 |
| C. Acknowledgements =          | <ul> <li>Portions of the DGEMM and SGEMM library routines optimized for Fermi architecture were developed by to of Tennessee. Subsequently, several other routines that are optimized for the Fermi architecture have be from these initial DGEMM and SGEMM implementations.</li> <li>The substantial optimizations of the STRSV, DTRSV, CTRSV and ZTRSV library routines were developed by of The Science and Technology Facilities Council (STRC). Subsequently, some optimizations of the STRSM CTRSM and ZTRSM have been derived from these TRSV implementations.</li> <li>Substantial optimizations of the SYMV and HEMV library routines were developed by Ahmad Abdelfattah, and Hatem Ltaief of King Abdull an University of Science and Technology (KAUST).</li> <li>Substantial optimizations of the TRMM and TRSM library routines were developed by Ali Charara, David Ke Hatem Ltaief of King Abdullah University of Science and Technology (KAUST).</li> </ul> | nan Hogg<br>M,<br>Keyes |



c/o A. Abdelfattah (ICL, KAUST'15), A. Charara & H. Ltaief (KAUST)



## **Extending KBLAS to batched execution**

- Batched BLAS workshop:
  - http://bit.ly/Batch-BLAS-2017
- Problem:
  - L2 BLAS individually of low arithmetic intensity
  - memory latency overheads
- Redesign the legacy BLAS API
  - launch thousands of small BLAS kernels simultaneously
  - increase device occupancy
  - remove API/kernel launch overheads
  - extend the recursive formulation
- Driven by scientific data-sparse applications
  - computational statistics and astronomy
  - Schur complement in sparse direct solvers and BDDC preconditioning

## **Batched operations**





c/o Jacob Kurzak (ICL, U Tennessee)

#### KBLAS Example: Batched POTRF • Nested recursion • Convert into batch of large GEMMs

.....

- Minimize data transfer
- Enhance data locality
- Increase arithmetic intensity

Recursive

**Batch POTRF** 

Recursive Batch TRSM

Recursive

Batch POTRF

h TRSM

Recursive Batch SYRK

c/o A. Charara & H. Ltaief (KAUST)



## **Batched KBLAS** performance comparisons





## **Batched KBLAS** performance comparisons





## Batched KBLAS performance



batch size.

c/o A. Charara & H. Ltaief (KAUST)

#### Conclusions

- Plenty of ideas exist to adapt or substitute for favorite solvers with methods that have:
  - reduced synchrony (in frequency and/or span)
  - higher residence on the memory hierarchy
  - greater SIMT/SIMD-style shared-memory concurrency
- Programming models and runtimes may have to be stretched to accommodate
- Everything should be on the table for trades, beyond disciplinary thresholds → "co-design"

# Hierarchical Computations on Manycore Architectures: HiCMA\*



\* appearing incrementally at https://github.com/ecrc

#### **Thanks to:**







# Thank you!



https://github.com/ecrc/

HHH

david.keyes@kaust.edu.sa