Tuesday, July 24, 2012

Ten Thousand Cores


Here’s my take on the How Would You Program 10,000 Cores question. As is my usual practice, I look at the issue from an architectural perspective.

Guiding Principles

In order of importance
  • Timely results of sufficient accuracy.
  • Heterogeneous environment support.
  • All running hardware should be doing useful non-redundant work; otherwise hardware should be idled in “power save” mode.
  • Logging and diagnostics should be accurate, fine-grained, and time alignable to support diagnostics and manual or automated tuning. Tuning includes both the hardware and software environment so that the interactions can be optimal.

Up Front Decisions

Ordering decisions by their temporal stickiness (how hard it is to unwind, after you implement the decision) one gets the following:1
  • Data center topology. There are three primary options that support different computational problems:
    • Spatially distributed processing: e.g., integrating phone accelerometer data for traffic estimation. The sensor data rolls up into regional traffic data, various summary statistics roll up nationally/worldwide.
    • 3 D torus: primarily used for scientific computations requiring a great deal of interprocessor communication
    • Hierarchical tree: the topology found in most data centers. However, it’s still useful to consider specialized subsets tuned to support spatially distributed and scientific problems that exceed the capabilities of a few cores.
    • Other topology considerations: in standard configuration, you can get about 2K cores per rack, 5 racks does not constitute a real estate burden per se. However desire for redundancy, aka “the power is out in Virginia and Amazon’s servers are down,” may drive the solution in a different direction. Ideally these considerations are addressed at the platform level and minimally impact the programming model.
  • Processor mix: potentially general-purpose, low power and GPUs can be intermingled on the same “board”.
    • Tools like LLVM2 allow programs to be optimized load time rather than precompiled and wired for specific configuration. Optimal support of this approach requires the ability to map the observed program behavior to the optimal board family, but with LLVM (or equivalent) these mappings become heuristic optimizations rather than compile time decisions.

The following three areas speak more to the expected answer and so I will deal with them in greater detail in the following sections.
  • Platform capabilities: tooling to support program and data migration, failure recovery, priority.
  • Memory hierarchy: all stages from on-chip to replicated persistent storage.
  • Program structure: constructs and optimizations to support maximum performance with minimum redundant effort.

Platform Capabilities


In practical terms, one of the most important capabilities provided by a platform is a logging facility. Logs should be able to be easily coalesced and time aligned so that behavior can be understood, and tuning can be performed at multiple scales. A good logging facility should include not only core events but also communication events, storage events etc. Logging also provides a substrate for adaptive tuning techniques.

Heuristic parameterization/modularization facility:

Tools must allow for the parametric specification of the process for handling loads, and specialized computational situations including:
  • Spin up a new instance to keep size of computation under a specified resource limit, spin up a new instance if queue size increases beyond a limit, spin down an instance if queue size decreases below the limit for a significant period of time.
  • Create parallel queues: including the ability to split/coalesce with bidirectional hooks to the heuristic parameterization facility.
  • Constraint monitoring and enforcement
    • Timing
    • Resources (min/max)
    • Priority (time and space?)
  • Memory access patterns
    • Keep ready, if possible (cache even if last access time limit exceeded)
    • Use once
  • Checkpointing/failover


These conceptually overlap with the parameterization/modularization facility but are geared more towards “within process” management rather than management of and across processes. They provide the capability for specifying “out of band” information to the platform to allow for better operations management. These annotations are not contracts with the platform, but hints.

There is a tension here between Automated vs. Manual support. Typically new problems, new algorithms, and new architectures push you in new directions and demand tools that provide greater manual control, since early on there is little understanding of the capabilities needed for adequate tuning, let alone how to achieve them algorithmically.

Over time, the automated tooling improves. The interactions between problem and architecture become understood and the greater analytic capability provided by improved hardware allows tuning decisions to be migrated from manual methods to the operational “autonomic” functions of the system itself.

The very old-school example of this is that to my knowledge no one uses the register designation for variables in C anymore.

Specific Useful annotations:

  • Data flow specifications
    • Sources
    • Sinks
  • Persistence (process available for network wake up)
  • Parent-child grouping management, (as developed for x10 ). Note: I consider X10’s win over FORTRESS another example of the advantage provided by the manual annotation in a cutting-edge computational environment.
  • Process affinity (hierarchical).
  • A priori spatial distribution of consumer processes.
  • Memory access patterns3
    • Keep ready
    • Once only
    • Normal

Maintenance Considerations

Whether it’s rolling out a critical library patch, fixing a misbehaving component, or simple upgrade, many of the processes required from the infrastructure are the same.

  • Identify the items requiring upgrade (IRUs).
  • Identify the items depending upon the IRUs.
  • Shield the dependent items from the upgrade. Depending upon the nature of the upgrade, this might include steps that degrade system performance, e.g., security concerns might require deep packet filtering.
  • Perform the upgrade.
  • Test the upgraded items.
  • Deploy and monitor the upgraded items.

Maintenance is therefore critically dependent upon logging.

  • When a process is initiated all components should be noted, by designation and SHA1 if possible.
  • While a processes is running, all interactions should be noted at a sufficient level to at least be able to identify “A talks to B” even if we don’t know the content of the communication4

Memory hierarchy

The first few levels of memory hierarchy are on-chip and I don’t expect them to be available for tuning.

That said all off chip storage components are available for tuning. The two most fruitful memory components are the reduction of rotating storage to a minimum (becoming standard practice) and the replication of data sets and computational results to preposition them for subsequent processing. My current thinking is that the easiest way to support this is to task certain cores with the management of storage that is “close to” a particular set of consumers. This storage manager would be able to preposition data at any accessible level of the core’s storage hierarchy (perhaps by accessing services within the target core).

Program structure

Along with the platform and annotations support there is a role here for what I would term multicore lifting this is analogous to the lifting performed by compiler optimizers when they move repetitive operations outside of the loop. In the multicore case, the lifting initiates a computation on another core as soon as the information necessary to perform that operation is available and the necessity of performing the operation is reasonably assured.

Consider the simplified program graph below

10KC optimization lift

If, when the program hits the green node, the computation performed by the blue node is both expected to be necessary and completely specified, the blue node’s computation can begin.

In the diagram, the “lift height” represents time expended in evaluating the gray nodes. This height specifies the maximum distance to search for potential nodes to perform the blue task.5 This distance (time) includes all latencies, data transfer times and any delays induced upon the main core by performing the transfer and receiving the results. If the two cores are different, the speed gains or delays must be factored inappropriately.

Note: Identifying “blue nodes” is the type of activity facilitated by such languages as Haskell and techniques such as dataflow annotations -- obviously the results of the “blue nodes” can be distributed throughout the system as necessary.


The simplest way to characterize my answer to the “how would you program 10,000 cores” problem is “I’d do everything I can to assure that they are doing something useful, otherwise the goal is to minimize their power consumption”

A great deal is set in place before programming starts: since the nature of the cores being used, the wiring of the data center; the number of data centers are already fixed at that point and difficult to change quickly.

Platform capabilities provide the software environment in which the program operates and represent a second “external” determinant of performance. Therefore it is necessary to provide facilities that not only can evolve as usage patterns and technology changes but also can capture the information necessary to inform those changes.

The programs themselves are fundamentally constrained by these decisions. Languages that simplify the identification of identical constructs, non-mutated values and minimize the requirements for explicit synchronization form the next layer of environmental choices that impact performance but are not strictly a part of the program itself.

I have not concentrated on specific solutions to specific problems, since the domain under consideration wasn’t specified, and in my experience empirical results from simulated or actual program runs are necessary for optimization. This is therefore highly dependent upon the nature of the problem, the problem data set and the processing environment.

1. All decisions are predicated upon the type of problems being solved. I try to cover most “reasonable” options but concentrate on those that would seem likely.
2. http://llvm.org/.
3. Similar in spirit to those capabilities provided by vmtouch .
4. Expect that there’s a whole literature on VM maintenance. However, since it’s not my area of expertise, I don’t think it’s appropriate to address it here.
5. Assuming the vertical axis is appropriately scaled time

No comments: