# Using Application-Specific Performance Models to Inform Dynamic Scheduling

Jeffrey S. Vetter
Jeremy Meredith
and many collaborators

Presented to
11th Scheduling for Large Scale Systems Workshop

18-21 May 2016 Vanderbuilt University Nashville





### **Executive Summary**

- Resource scheduling is a seminal problem for computing ... and it is becoming much more difficult
  - Scheduling has many potential solutions
    - Algorithmic, Historical, application specific, etc
- Both architectures and applications are growing more complex
  - Trends dictate that this will get worse; not better
  - This complexity creates irregularity in computation, communication, and data movement
- Posit that we can use application-specific performance models to inform scheduling decisions
  - Aspen performance modeling language helps create models
  - Two recent experiments
    - GPU offload
    - Distributed scientific workflows



### Scheduling is critical to CS ...



#### 11th Scheduling for Large Scale Systems Workshop

18-21 May 2016 Vanderbilt University, Nashville, TN (United States)



#### MAIN MENU

#### Home

Program

List of participants

Talks sorted by speakers

Talks sorted by themes

Venue

Hotels

Sponsors

HELP

@ Contact

#### THE 11TH SCHEDULING FOR LARGE SCALE SYSTEMS WORKSHOP

The 11th Scheduling for Large Scale Systems Workshop will be held at Vanderbilt University in Nashville Tennessee, May 18-May 20. Thi will be the eleventh edition of this workshop series after Aussois (2004), San Diego (2005), Aussois (2008), Knoxville (2009), Aussois (201 and 2011), Pittsburgh (2012), Dagstuhl (2013), Lyon (2014) and Dagsthul (2015).

As in the past, the workshop will be structured as a set of thematic half-day sessions, mainly focused on scheduling and algorithms for large-scale systems. In addition to the talks (about 20 minutes each), plenty of time will be left for informal discussion and exchanges.

The workshop is by invitation only and there will be no registration fee.

Fill this poll to register for meals and the social event (that should take place in Mammoth Cave National Park):

https://framadate.org/Us5qiTviro47HrIW

#### ORGANIZING COMMITTEE

Guillaume Aupy, George Bosilca, Henri Casanova, Julien Langou, Padma Raghavan and Yves Robert



### Trends toward Exascale



### Exascale architecture targets circa 2009 2009 Exascale Challenges Workshop in San Diego

#### Attendees envisioned two possible architectural swim lanes:

- 1. Homogeneous many-core thin-node system
- 2. Heterogeneous (accelerator + CPU) fat-node system

| System attributes    | 2009     | "Pre-Exascale" |          | "Exascale"  |           |
|----------------------|----------|----------------|----------|-------------|-----------|
| System peak          | 2 PF     | 100-200 PF/s   |          | 1 Exaflop/s |           |
| Power                | 6 MW     | 15 MW          |          | 20 MW       |           |
| System memory        | 0.3 PB   | 5 PB           |          | 32–64 PB    |           |
| Storage              | 15 PB    | 150 PB         |          | 500 PB      |           |
| Node performance     | 125 GF   | 0.5 TF         | 7 TF     | 1 TF        | 10 TF     |
| Node memory BW       | 25 GB/s  | 0.1 TB/s       | 1 TB/s   | 0.4 TB/s    | 4 TB/s    |
| Node concurrency     | 12       | O(100)         | O(1,000) | O(1,000)    | O(10,000) |
| System size (nodes)  | 18,700   | 500,000 50,000 |          | 1,000,000   | 100,000   |
| Node interconnect BW | 1.5 GB/s | 150 GB/s       | 1 TB/s   | 250 GB/s    | 2 TB/s    |
| IO Bandwidth         | 0.2 TB/s | 10 TB/s        |          | 30-60 TB/s  |           |
| MTTI                 | day      | O(1 day)       |          | O(0.1 day)  |           |

### Contemporary ASCR Computing At a Glance

| System attributes     | NERSC<br>Now                               | OLCF<br>Now                             | ALCF<br>Now                | NERSC Upgrade                                                                     | OLCF Upgrade                                                    | ALCF U                                              | Jpgrades                                                                            |
|-----------------------|--------------------------------------------|-----------------------------------------|----------------------------|-----------------------------------------------------------------------------------|-----------------------------------------------------------------|-----------------------------------------------------|-------------------------------------------------------------------------------------|
| Planned Installation  | Edison                                     | TITAN                                   | MIRA                       | Cori<br>2016                                                                      | Summit<br>2017-2018                                             | Theta<br>2016                                       | Aurora<br>2018-2019                                                                 |
| System peak (PF)      | 2.6                                        | 27                                      | 10                         | > 30                                                                              | 150                                                             | >8.5                                                | 180                                                                                 |
| Peak Power (MW)       | 2                                          | 9                                       | 4.8                        | < 3.7                                                                             | 10                                                              | 1.7                                                 | 13                                                                                  |
| Total system memory   | 357 TB                                     | 710TB                                   | 768TB                      | ~1 PB DDR4 + High<br>Bandwidth Memory<br>(HBM)+1.5PB<br>persistent memory         | > 1.74 PB DDR4 +<br>HBM + 2.8 PB<br>persistent memory           | >480 TB DDR4 +<br>High Bandwidth<br>Memory (HBM)    | > 7 PB High Bandwidth<br>On-Package Memory<br>Local Memory and<br>Persistent Memory |
| Node performance (TF) | 0.460                                      | 1.452                                   | 0.204                      | > 3                                                                               | > 40                                                            | > 3                                                 | > 17 times Mira                                                                     |
| Node processors       | Intel Ivy<br>Bridge                        | AMD<br>Opteron<br>Nvidia<br>Kepler      | 64-bit<br>PowerPC<br>A2    | Intel Knights Landing<br>many core CPUs<br>Intel Haswell CPU in<br>data partition | Multiple IBM<br>Power9 CPUs &<br>multiple Nvidia<br>Voltas GPUS | Intel Knights Landing<br>Xeon Phi many core<br>CPUs | Knights Hill Xeon Phi<br>many core CPUs                                             |
| System size (nodes)   | 5,600<br>nodes                             | 18,688<br>nodes                         | 49,152                     | 9,300 nodes<br>1,900 nodes in data<br>partition                                   | ~3,500 nodes                                                    | >2,500 nodes                                        | >50,000 nodes                                                                       |
| System Interconnect   | Aries                                      | Gemini                                  | 5D Torus                   | Aries                                                                             | Dual Rail<br>EDR-IB                                             | Aries                                               | 2 <sup>nd</sup> Generation Intel<br>Omni-Path Architecture                          |
| File System           | 7.6 PB<br>168 GB/s,<br>Lustre <sup>®</sup> | 32 PB<br>1 TB/s,<br>Lustre <sup>®</sup> | 26 PB<br>300 GB/s<br>GPFS™ | 28 PB<br>744 GB/s<br>Lustre <sup>®</sup>                                          | 120 PB<br>1 TB/s<br>GPFS™                                       | 10PB, 210 GB/s<br>Lustre initial                    | 150 PB<br>1 TB/s<br>Lustre <sup>®</sup>                                             |

Complexity α T





### Complexity is the next major challenge!

- "Exciting" times in computer architecture
  - Heterogeneous cores
  - Multimode memory systems
  - Fused memory systems
  - I/O architectures
  - Error correction
  - Changing system balance
- Uncertainty, Ambiguity
  - How do we design future systems so that they are faster than current systems on mission applications?
    - Entirely possible that the new system will be slower than the old system!
  - How do we provide some level of performance portability for applications teams?
  - How do we understand reliability and performance problems?
- Managing complexity is our main challenge!



### Performance Prediction with Aspen



### Example Ad Hoc Model: Latex Equations

the communication and computation of two sheets. The expression we get for the runtime is

$$T = 2 \left[ t_c \frac{n}{p} n \log_2 n + (p-1)o + (p-2)g + \frac{n}{p} nG + L + \left( \frac{n}{p} - 1 \right) \max \left\{ (p-1)o + t_c \frac{n}{p} n \log_2 n, (p-1)g + \frac{n}{p} nG + L \right\} \right] + t_c \frac{n^2}{p^2} n \log_2 n$$



### Example: Ad-Hoc Excel Files





### **Prediction Techniques Ranked**

|                              | Speed | Ease | Flexibility | Accuracy | Scalability |
|------------------------------|-------|------|-------------|----------|-------------|
| Ad-hoc Analytical Models     | 1     | 3    | 2           | 4        | 1           |
| Structured Analytical Models | 1     | 2    | 1           | 4        | 1           |
| Aspen                        | 1     | 1    | 1           | 4        | 1           |
| Simulation – Functional      | 3     | 2    | 2           | 3        | 3           |
| Simulation – Cycle Accurate  | 4     | 2    | 2           | 2        | 4           |
| Hardware Emulation (FPGA)    | 3     | 3    | 3           | 2        | 3           |
| Similar hardware measurement | 2     | 1    | 4           | 2        | 2           |
| Node Prototype               | 2     | 1    | 4           | 1        | 4           |
| Prototype at Scale           | 2     | 1    | 4           | 1        | 2           |
| Final System                 | -     | _    | -           | -        | -           |
| •                            |       |      |             |          |             |



### Aspen: Abstract Scalable Performance Engineering Notation

#### **Model Creation**

- Static analysis via compiler, tools
- Empirical, Historical
- Manual (for future applications)

#### **Representation in Aspen**

- Modular
- Sharable
- Composable
- Reflects prog structure

#### **Model Uses**

- Interactive tools for graphs, queries
- Design space exploration
- Workload Generation
- Feedback to Runtime Systems

E.g., MD, UHPC CP 1, Lulesh, 3D FFT, CoMD, VPFFT, ...

#### Source code

```
void CalcMonotonicQGradientsForElems(Index_t p_nodelist[T_NUMELEM8],
               Real_t p_x[T_NUMNODE], Real_t p_y[T_NUMNODE], Real_t p_z[T_NUMNODE],
                Real_t p_xd[T_NUMNODE], Real_t p_yd[T_NUMNODE], Real_t p_zd[T_NUMNODE],
                Real_t p_volo[T_NUMELEM], Real_t p_vnew[T_NUMELEM],
                Real_t p_delx_zeta[T_NUMELEM], Real_t p_delv_zeta[T_NUMELEM]
                Real_t p_delx_xi[T_NUMELEM], Real_t p_delv_xi[T_NUMELEM],
                Real_t p_delx_eta[T_NUMELEM], Real_t p_delv_eta[T_NUMELEM]
2332 🛱 {
          Index t numElem = m numElem;
       #pragma acc parallel loop independent present(p_vnew, p_nodelist, p_x, p_y, p_z, p_xd, \
p_yd, p_zd, p_volo, p_delx_xi, p_delx_eta, p_delx_zeta, p_delv_xi, p_delv_eta,
2337 p_delv_zeta)
          for (i = 0 ; i < numElem ; ++i ) {
             const Real_t ptiny = 1.e-36 ;
              Real t ax, av, az ;
              Real t dxv, dyv, dzv
              const Index_t *elemToNode = &p_nodelist[8*i];
              Index t n0 = elemToNode[0] ;
              Index t n1 = elemToNode[1]
              Index t n2 = elemToNode[2] ;
              Index t n3 = elemToNode[3] ;
              Index t n4 = elemToNode[4] ;
              Index t n5 = elemToNode[5] ;
              Index t n6 = elemToNode[6] ;
              Index t n7 = elemToNode[7] :
              Real t x0 = p x[n0]
```

#### Aspen code

```
kernel CalcMonotonicOGradients
  execute [numElems]
    loads [8 * indexWordSize] from nodelist
    // Load and cache position and velocity
    loads/caching [8 * wordSize] from x
    loads/caching [8 * wordSize] from y
    loads/caching [8 * wordSize] from z
    loads/caching [8 * wordSize] from xvel
    loads/caching [8 * wordSize] from yvel
    loads/caching [8 * wordSize] from zvel
    loads [wordSize] from volo
   loads [wordSize] from vnew
    // dx, dy, etc.
    flops [90] as dp, simd
    flops [9 + 8 + 3 + 30 + 5] as dp, simd
    flops [9 + 8 + 3 + 30 + 5] as dp, simd
    stores [wordSize] to delx_xi
    // delxj and delvj
    flops [9 + 8 + 3 + 30 + 5] as dp, simd
    stores [wordSize] to delv_eta
```



### Creating an Aspen Model



### Manual Example of LULESH

```
p branch: master 

aspen / models / lulesh / lulesh.aspen

aspen / models / lulesh / lulesh.aspen

aspen / models / lulesh / lulesh.aspen

p branch: master 

aspen / models / lulesh / lulesh.aspen

aspen / models / lulesh / lulesh.aspen

aspen / models / lulesh / lulesh.aspen

aspen / models / lulesh / lulesh / lulesh.aspen

aspen / models / lulesh / lules
                                                                                                                                                                                                                                                                                                             := 🚉
jsmeredith on Sep 20, 2013 adding models
1 contributor
336 lines (288 sloc) 9.213 kb
                                                                                                                                                                                                                               Raw Blame History
   2 // lulesh.aspen
   3 //
    4 // An ASPEN application model for the LULESH 1.01 challenge problem. Based
    5 // on the CUDA version of the source code found at:
   6 // https://computation.llnl.gov/casc/ShockHydro/
   7 //
   8 param nTimeSteps = 1495
  10 // Information about domain
  11 param edgeElems = 45
  12 param edgeNodes = edgeElems + 1
  14 param numElems = edgeElems^3
  15 param numNodes = edgeNodes^3
  17 // Double precision
  18 param wordSize = 8
 20 // Element data
 21 data mNodeList as Array(numElems, wordSize)
 22 data mMatElemList as Array(numElems, wordSize)
 data mNodeList as Array(8 * numElems, wordSize) // 8 nodes per element
 24 data mlxim as Array(numElems, wordSize)
 25 data mlxip as Array(numElems, wordSize)
 26 data mletam as Array(numElems, wordSize)
 27 data mletap as Array(numElems, wordSize)
 28 data mzetam as Array(numElems, wordSize)
 29 data mzetap as Array(numElems, wordSize)
 30 data melemBC as Array(numElems, wordSize)
 31 data mE as Array(numElems, wordSize)
   32 data mP as Array(numElems, wordSize)
```

```
kernel CalcMonotonicQGradients {
        execute [numElems]
149
150
         loads [8 * indexWordSize] from nodelist
         // Load and cache position and velocity.
151
         loads/caching [8 * wordSize] from x
         loads/caching [8 * wordSize] from y
154
         loads/caching [8 * wordSize] from z
         loads/caching [8 * wordSize] from xvel
         loads/caching [8 * wordSize] from yvel
         loads/caching [8 * wordSize] from zvel
158
159
160
         loads [wordSize] from volo
         loads [wordSize] from vnew
         // dx, dy, etc.
         flops [90] as dp, simd
164
         // delvk delxk
         flops [9 + 8 + 3 + 30 + 5] as dp, simd
         stores [wordSize] to delv_xeta
166
         // delxi delvi
         flops [9 + 8 + 3 + 30 + 5] as dp, simd
168
169
         stores [wordSize] to delx_xi
170
         // delxj and delvj
         flops [9 + 8 + 3 + 30 + 5] as dp, simd
         stores [wordSize] to delv_eta
174 }
```



### **COMPASS System Overview**





### MM example generated from COMPASS

```
int N = 1024;
     void matmul(float *a, float *b, float *c){ int i, j, k;
     #pragma acc kernels loop gang copyout(a[0:(N*N)]) \
     copyin(b[0:(N*N)],c[0:(N*N)])
      for (i=0; i< N; i++)
     #pragma acc loop worker
        for (j=0; j<N; j++) { float sum = 0.0;
         for (k=0; k< N; k++) \{sum+=b[i*N+k]*c[k*N+j];\}
         a[i*N+j] = sum; 
 9
      } //end of i loop
     } //end of matmul()
11
     int main() {
12
      int i; float *A = (float*) malloc(N*N*sizeof(float));
      float *B = (float*) malloc(N*N*sizeof(float));
14
      float *C = (float*) malloc(N*N*sizeof(float));
15
      for (i = 0; i < N*N; i++)
16
      \{A[i] = 0.0F; B[i] = (float) i; C[i] = 1.0F; \}
17
     #pragma aspen modelregion label(MM)
      matmul(A,B,C);
19
      free(A); free(B); free(C); return 0;
20
21
     } //end of main()
```

```
model MM {
      param floatS = 4; param N = 1024
      data A as Array((N*N), floatS)
      data B as Array((N*N), floatS)
      data C as Array((N*N), floatS)
      kernel matmul {
       execute matmul2_intracommIN
       { intracomm [floatS*(N*N)] to C as copyin
         intracomm [floatS*(N*N)] to B as copyin }
 9
       map matmul2 [N] {
10
         map matmul3 [N] {
11
12
          iterate [N] {
13
            execute matmul5
            { loads [floatS] from B as stride(1)
14
             loads [floatS] from C; flops [2] as sp, simd }
15
          } //end of iterate
16
17
          execute matmul6 { stores [floatS] to A as stride(1) }
         } // end of map matmul3
18
       } //end of map matmul2
19
20
       execute matmul2_intracommOUT
21
       { intracomm [floatS*(N*N)] to A as copyout }
22
      } //end of kernel matmul
23
      kernel main { matmul() }
^{24}
     } //end of model MM
```



### Example: LULESH (10% of 1 kernel)

```
kernel IntegrateStressForElems
   execute [numElem CalcVolumeForceForElems]
     loads [((1*aspen_param_int)*8)] from elemNodes as stride(1) loads [((1*aspen_param_double)*8)] from m_x loads [((1*aspen_param_double)*8)] from m_y loads [((1*aspen_param_double)*8)] from m_z loads [(1*aspen_param_double)] from determ as stride(1)
     flops [8] as dp, simd
flops [3] as dp, simd
                                                                                                                                - Input LULESH program:
      flops [3] as dp, simd
                                                                                                                                3700 lines of C codes
      stores [(1*aspen_param_double)] as stride(0)
      flops [2] as dp, simd

    Output Aspen model:

      stores [(1*aspen_param_double)] as stride(0) flops [2] as dp, simd
                                                                                                                                2300 lines of Aspen codes
      stores [(1*aspen_param_double)] as stride(0)
      flops [2] as dp, simd
      loads [(1*aspen_param_double)] as stride(0) stores [(1*aspen_param_double)] as stride(0) loads [(1*aspen_param_double)] as stride(0) stores [(1*aspen_param_double)] as stride(0)
      loads [(1*aspen param double)] as stride(0)
```



### **Model Validation**

|           | FLOPS | LOADS | STORES |
|-----------|-------|-------|--------|
| MATMUL    | 15%   | <1%   | 1%     |
| LAPLACE2D | 7%    | 0%    | <1%    |
| SRAD      | 17%   | 0%    | 0%     |
| JACOBI    | 6%    | <1%   | <1%    |
| KMEANS    | 0%    | 0%    | 8%     |
| LUD       | 5%    | 0%    | 2%     |
| BFS       | <1%   | 11%   | 0%     |
| HOTSPOT   | 0%    | 0%    | 0%     |
| LULESH    | 0%    | 0%    | 0%     |

0% means that prediction fell between measurements from optimized and unoptimized runs of the code.



Figure 2: Predicted Resource Usage of LULESH versus Measured (with and without compiler optimization)

### Black Box Analytical Modeling

- In some cases, we do not have access to a white box Aspen performance model
- Using input vector and empirical results, we can develop Aspen Black Box model
- User provides
  - measured runtimes with app/machine parameters
    - e.g. nAtoms, nCores
  - template Aspen model with
    - application parameters
    - unknowns to solve for
  - new machine models (if necessary)
- Modeling tool
  - Generates symbolic predictions
  - Combines with measurements to generate objective function
  - Solves for unknowns in template
  - Output: completed app model usable for predictive behavior





### Black Box Modeling Example

#### MD template model

### CSV data file with parameters and runtimes

|   | nAtoms | nTimeStep<br>s | nCores | machine | runtime |
|---|--------|----------------|--------|---------|---------|
| + | 1e6    | 100            | 144    | exogeni | 384.2   |
| • | 1e6    | 100            | 144    | hopper  | 340.1   |
|   | 1e6    | 150            | 144    | hopper  | 482.9   |

#### Concrete NAMD model

```
model NAMD_Equilibrate {
    // NAMD input parameters
    param nAtoms = 1e6
    param nTimeSteps = 100

    // calculation-specific constants
    param c = 402.1
    param d = 10.95

// NAMD application behavior
    kernel main
    {
        iterate [nTimeSteps] {
            execute {
             loads [c * nAtoms^2]
                 flops [d * nAtoms]
            }
        }
     }
}
```

- nAtoms and nTimeSteps defined in template application model and CSV input data
- nCores defined in machine models and CSV input data
- solves for c and d, filling out a concrete application model for that problem
- new predictions can still vary nAtoms, nTimeSteps, and nCores



### Using an Aspen Model



### Aspen: Abstract Scalable Performance Engineering Notation

#### **Model Creation**

- Static analysis via compiler, tools
- Empirical, Historical
- Manual (for future applications)

#### **Representation in Aspen**

- Modular
- Sharable
- Composable
- Reflects prog structure

#### **Model Uses**

- Interactive tools for graphs, queries
- Design space exploration
- Workload Generation
- Feedback to Runtime Systems

E.g., MD, UHPC CP 1, Lulesh, 3D FFT, CoMD, VPFFT, ...

#### Source code

```
void CalcMonotonicQGradientsForElems(Index_t p_nodelist[T_NUMELEM8],
               Real_t p_x[T_NUMNODE], Real_t p_y[T_NUMNODE], Real_t p_z[T_NUMNODE],
                Real_t p_xd[T_NUMNODE], Real_t p_yd[T_NUMNODE], Real_t p_zd[T_NUMNODE],
                Real_t p_volo[T_NUMELEM], Real_t p_vnew[T_NUMELEM],
                Real_t p_delx_zeta[T_NUMELEM], Real_t p_delv_zeta[T_NUMELEM]
                Real_t p_delx_xi[T_NUMELEM], Real_t p_delv_xi[T_NUMELEM],
                Real_t p_delx_eta[T_NUMELEM], Real_t p_delv_eta[T_NUMELEM]
2332 🛱 {
          Index t numElem = m numElem;
       #pragma acc parallel loop independent present(p_vnew, p_nodelist, p_x, p_y, p_z, p_xd, \
p_yd, p_zd, p_volo, p_delx_xi, p_delx_eta, p_delx_zeta, p_delv_xi, p_delv_eta,
2337 p_delv_zeta)
          for (i = 0 ; i < numElem ; ++i ) {
             const Real_t ptiny = 1.e-36 ;
              Real t ax, av, az ;
              Real t dxv, dyv, dzv
              const Index_t *elemToNode = &p_nodelist[8*i];
              Index t n0 = elemToNode[0] ;
              Index t n1 = elemToNode[1]
              Index t n2 = elemToNode[2] ;
              Index t n3 = elemToNode[3] ;
              Index t n4 = elemToNode[4] ;
              Index t n5 = elemToNode[5] ;
              Index t n6 = elemToNode[6] ;
              Index t n7 = elemToNode[7] :
              Real t x0 = p x[n0]
```

#### Aspen code

```
kernel CalcMonotonicOGradients
  execute [numElems]
    loads [8 * indexWordSize] from nodelist
    // Load and cache position and velocity
    loads/caching [8 * wordSize] from x
    loads/caching [8 * wordSize] from y
    loads/caching [8 * wordSize] from z
    loads/caching [8 * wordSize] from xvel
    loads/caching [8 * wordSize] from yvel
    loads/caching [8 * wordSize] from zvel
    loads [wordSize] from volo
   loads [wordSize] from vnew
    // dx, dy, etc.
    flops [90] as dp, simd
    flops [9 + 8 + 3 + 30 + 5] as dp, simd
    flops [9 + 8 + 3 + 30 + 5] as dp, simd
    stores [wordSize] to delx_xi
    // delxj and delvj
    flops [9 + 8 + 3 + 30 + 5] as dp, simd
    stores [wordSize] to delv_eta
```



# View Aspen performance models as normal performance analysis output with Gprof







### Aspen Model User Queries

| Benchmark         | Runtime Order            |
|-------------------|--------------------------|
| BACKPROP          | H*O+H*I                  |
| $_{\mathrm{BFS}}$ | nodes + edges            |
| CFD               | nelr*ndim                |
| CG                | nrow + ncol              |
| HOTSPOT           | $sim_time * rows * cols$ |
| JACOBI            | $m\_size*m\_size$        |
| KMEANS            | nAttr*nClusters          |
| LAPLACE2D         | $n^2$                    |
| LUD               | $matrix\_dim^3$          |
| MATMUL            | N * M * P                |
| NW                | $max\_cols^2$            |
| SPMUL             | size + nonzero           |
| SRAD              | niter*rows*cols          |

Table 2: Order analysis, showing Big O runtime for each benchmark in terms of its key parameters.

| Method Name                      | FLOPS/byte |
|----------------------------------|------------|
| InitStressTermsForElems          | 0.03       |
| CalcElemShapeFunctionDerivatives | 0.44       |
| SumElemFaceNormal                | 0.50       |
| CalcElemNodeNormals              | 0.15       |
| SumElemStressesToNodeForces      | 0.06       |
| IntegrateStressForElems          | 0.15       |
| CollectDomainNodesToElemNodes    | 0.00       |
| VoluDer                          | 1.50       |
| CalcElemVolumeDerivative         | 0.33       |
| CalcElemFBHourglassForce         | 0.15       |
| CalcFBHourglassForceForElems     | 0.17       |
| CalcHourglassControlForElems     | 0.19       |
| CalcVolumeForceForElems          | 0.18       |
| CalcForceForNodes                | 0.18       |
| CalcAccelerationForNodes         | 0.04       |
| ApplyAccelerationBoundaryCond    | 0.00       |
| CalcVelocityForNodes             | 0.13       |
| CalcPositionForNodes             | 0.13       |
| LagrangeNodal                    | 0.18       |
| AreaFace                         | 10.25      |
| CalcElemCharacteristicLength     | 0.44       |
| CalcElemVelocityGrandient        | 0.13       |
| CalcKinematicsForElems           | 0.24       |
| CalcLagrangeElements             | 0.24       |
| CalcMonotonicOGradientsForElems  | 0.46       |







Fig. 8: GPU Memory Usage of each Function in LULESH, where the memory usage of a function is inclusive; value for a parent function includes data accessed by its child functions in the call graph.



Fig. 7: Measured and predicted runtime of the entire LULESH program on CPU and GPU, including measured runtimes using the automatically predicted optimal target device at each size.



## Scheduling GPU Offloads with Aspen Performance Models



### Should the application offload kernel to GPU or not?

- Simply offloading all computation is not smart
- Depends
  - When it is 'small', run the computation on the host CPU,
  - Otherwise, send it to the GPU
  - Expense of data movement over PCIe (twice) and launch GPU kernels?
- Portability?
  - Need to account for performance, working set size, data transfer costs, ...

Listing 1: Input OpenACC Matrix Multiplication Code

```
int N = 1024:
    void matmul(float *a, float *b, float *c){ int i, j, k;
    #pragma acc kernels loop gang copyout(a[0:(N*N)]) \
     copyin(b[0:(N*N)],c[0:(N*N)])
      for (i=0; i< N; i++)
     #pragma acc loop worker
       for (j=0; j<N; j++) { float sum = 0.0;
         for (k=0; k< N; k++) \{sum+=b[i*N+k]*c[k*N+j];\}
         a[i*N+i] = sum; 
      } //end of i loop
     } //end of matmul()
11
     int main() {
12
     int i; float *A = (float*) malloc(N*N*sizeof(float));
13
      float *B = (float*) malloc(N*N*sizeof(float));
14
      float *C = (float*) malloc(N*N*sizeof(float));
15
16
      for (i = 0; i < N*N; i++)
      \{A[i] = 0.0F; B[i] = (float) i; C[i] = 1.0F; \}
17
     #pragma aspen modelregion label(MM)
18
      matmul(A,B,C);
19
      free(A); free(B); free(C); return 0;
     } //end of main()
```





### **Process**





### **Matrix Multiply**





### LULESH: Runtime and Working Set Size Predictions



Fig. 7: Measured and predicted runtime of the entire LULESH program on CPU and GPU, including measured runtimes using the automatically predicted optimal target device at each size.



Fig. 8: GPU Memory Usage of each Function in LULESH, where the memory usage of a function is inclusive; value for a parent function includes data accessed by its child functions in the call graph.



### Using Aspen for Distributed Workflows



### Aspen allows Multiresolution Modeling



#### Scope

Wide-Area Networking, Files, Many HPC systems, and Archives

Computation, Memory, Communication, IO

Computation, Memory, Threads



### Simulation: ACME Workflows

- Accelerated Climate Modeling for Energy (ACME)
- Coupled climate models with ocean, land, atmosphere and ice
- Climatologies and diagnostics give summaries of data
- Each stage of the workflow runs the ACME model for a few timesteps—helps keep simulations within batch queue limits
- Running on Hopper @ NERSC and Titan @ OLCF





### **Experimental Data: SNS Workflows**

- Spallation Neutron Source
- Parameter sweep of molecular dynamics and neutron scattering
- Used to identify parameters that fit experimental data from SNS
- Currently being used for real science problems
- Large runs use 20 parameter values and require ~400,000 CPU hours
- Running on Hopper @ NERSC and coming to Titan @ ORNL





### **PANORAMA** Overview





### Automatically Generate Aspen from Pegasus DAX; Use Aspen Predictions to Inform/Monitor Decisions

```
kernel main
        par
              call namd_eq_200()
              call namd_prod_200()
           seq
              call namd_eq_290()
              call namd-prod-290()
11
12
13
14
           call unpack_database()
15
           call ptraj_200()
16
           call ptraj-290()
17
18
19
           call sassena-incoh-200()
20
           call sassena_coh_200()
           call sassena_incoh_290()
           call sassena-coh-290()
23
```

Listing 1: Automatically generated Aspen model for cample SNS workflow.





### Black Box Modeling of NAMD for SNS Workflow

#### MD template model



### CSV data file with parameters and runtimes

| nAtoms | nTimeSteps | nCores | machine | runtime |
|--------|------------|--------|---------|---------|
| 1e6    | 100        | 144    | exogeni | 384.2   |
| 1e6    | 100        | 144    | hopper  | 340.1   |
| 1e6    | 150        | 144    | hopper  | 482.9   |

#### Concrete NAMD model

```
model NAMD Equilibrate {
  // NAMD input parameters
  param nAtoms
  param nTimeSteps = 100
  // calculation-specific constants
  param c = 402.1
  param d = 10.95
  // NAMD application behavior
  kernel main
    iterate [nTimeSteps] {
      execute {
        loads [c * nAtoms^2]
        flops [d * nAtoms]
```

- <u>nAtoms</u> and <u>nTimeSteps</u> defined in template application model and CSV input data
- nCores defined in machine models and CSV input data
- solves for  $\underline{c}$  and  $\underline{d}$ , filling out a concrete application model for that problem
- new predictions can still vary <u>nAtoms</u>, <u>nTimeSteps</u>, and <u>nCores</u>



### Aspen Model Use Case: Anomaly Detection

- Aspen models embedded into workflow Pegasus DAX model
- Evaluated in real time to compare with Pegasus monitoring data
- Highlight differences on Pegasus Dashboard

- Status:
  - Works well for individual components
  - Understanding global predictions
    - Several components not modeled: queue times, I/O subsystem





### Aspen Model Use Case: Resource Planning

- Augment Pegasus resource planner to include Aspen performance predictions
  - Extend Pegasus DAX with Aspen hooks
  - Catalog of relevant application models
  - Catalog of relevant resources
- During planning and mapping phases, optimize schedule with 'best' global resources for TTS, Cost, Power, etc
- Status
  - Aspen predictors completed
  - Currently augmenting Pegasus planner/mapper with callouts to Aspen predictors





### Summary

- Resource scheduling is a seminal problem for computing ... and it is becoming much more difficult
  - Scheduling has many potential solutions
    - Algorithmic, Historical, application specific, etc
- Both architectures and applications are growing more complex
  - Trends dictate that this will get worse; not better
  - This complexity creates irregularity in computation, communication, and data movement
- Posit that we can use application-specific performance models to inform scheduling decisions
  - Aspen performance modeling language helps create models
  - Two recent experiments
    - GPU offload
    - Distributed scientific workflows



### Acknowledgements



#### Contributors and Sponsors

- Future Technologies Group: <a href="http://ft.ornl.gov">http://ft.ornl.gov</a>
- US Department of Energy Office of Science
  - DOE Vancouver Project: https://ft.ornl.gov/trac/vancouver
  - DOE Blackcomb Project: <a href="https://ft.ornl.gov/trac/blackcomb">https://ft.ornl.gov/trac/blackcomb</a>
  - DOE ExMatEx Codesign Center: <a href="http://codesign.lanl.gov">http://codesign.lanl.gov</a>
  - DOE Cesar Codesign Center: <a href="http://cesar.mcs.anl.gov/">http://cesar.mcs.anl.gov/</a>
  - DOE Exascale Efforts: <a href="http://science.energy.gov/ascr/research/computer-science/">http://science.energy.gov/ascr/research/computer-science/</a>
- Scalable Heterogeneous Computing Benchmark team: <a href="http://bit.ly/shocmarx">http://bit.ly/shocmarx</a>
- US National Science Foundation Keeneland Project: <u>http://keeneland.gatech.edu</u>
- US DARPA
- NVIDIA CUDA Center of Excellence





### PMES Workshop @ SC16

- https://j.mp/pmes2016
- @SC16
- Position papers due June 17



2016 Post-Moore's Era upercomputing (PMES) Vorkshop Home

#### News

**Call For Position** Papers - Submission Deadline - June 17

**Invited Speakers** 

**Photos** 

Program

Resources

**Workshop Venue** 

Sitemap

PMES Workshop @ SC16

#### 2016 Post-Moore's Era Supercomputing (PMES) Workshop Home

Co-located with SC16 in Salt Lake City Monday, 14 November 2016

Workshop URL: http://j.mp/pmes2016 CFP URL: http://j.mp/pmes2016cfp

Submission URL (EasyChair): http://j.mp/pmes2016submissions

Submission questions: pmes16@easychair.org

This interdisciplinary workshop is organized to explore the scientific issues, challenges, and opportunities for supercomputing beyond the scaling limits of Moore's Law, with the ultimate goal of keeping supercomputing at the forefront of computing technologies beyond the physical and conceptual limits of current systems. Continuing progress In cooperation with IEEE Computer Society of supercomputing beyond the scaling limits of Moore's Law is likely to require a comprehensive re-thinking of technologies, ranging from innovative materials and devices, circuits, system architectures, programming systems, system software, and applications.

The workshop is designed to foster interdisciplinary dialog across the necessary spectrum of stakeholders: applications, algorithms, software, and hardware. Motivating workshop questions will include the following. "What technologies might prevail in the Post Moore's

#### News

PMES Workshop Confirmed for SC16! Submissions open for PMES Position Papers on April 17

#### Important Dates

- Submission Site Opens: 17 April 2016
- Submission Deadline: 17 June 2016
- Notification Deadline: 17 August 2016
- Workshop: 14 November 2016

IEEE Computer society



