Sunday, November 25, 2012
Tuesday, July 24, 2012
Here’s my take on the How Would You Program 10,000 Cores question. As is my usual practice, I look at the issue from an architectural perspective.
In order of importance
- Timely results of sufficient accuracy.
- Heterogeneous environment support.
- All running hardware should be doing useful non-redundant work; otherwise hardware should be idled in “power save” mode.
- Logging and diagnostics should be accurate, fine-grained, and time alignable to support diagnostics and manual or automated tuning. Tuning includes both the hardware and software environment so that the interactions can be optimal.
Up Front DecisionsOrdering decisions by their temporal stickiness (how hard it is to unwind, after you implement the decision) one gets the following:1
- Data center topology. There are three primary options that support different computational problems:
- Spatially distributed processing: e.g., integrating phone accelerometer data for traffic estimation. The sensor data rolls up into regional traffic data, various summary statistics roll up nationally/worldwide.
- 3 D torus: primarily used for scientific computations requiring a great deal of interprocessor communication
- Hierarchical tree: the topology found in most data centers. However, it’s still useful to consider specialized subsets tuned to support spatially distributed and scientific problems that exceed the capabilities of a few cores.
- Other topology considerations: in standard configuration, you can get about 2K cores per rack, 5 racks does not constitute a real estate burden per se. However desire for redundancy, aka “the power is out in Virginia and Amazon’s servers are down,” may drive the solution in a different direction. Ideally these considerations are addressed at the platform level and minimally impact the programming model.
- Processor mix: potentially general-purpose, low power and GPUs can be intermingled on the same “board”.
- Tools like LLVM2 allow programs to be optimized load time rather than precompiled and wired for specific configuration. Optimal support of this approach requires the ability to map the observed program behavior to the optimal board family, but with LLVM (or equivalent) these mappings become heuristic optimizations rather than compile time decisions.
The following three areas speak more to the expected answer and so I will deal with them in greater detail in the following sections.
- Platform capabilities: tooling to support program and data migration, failure recovery, priority.
- Memory hierarchy: all stages from on-chip to replicated persistent storage.
- Program structure: constructs and optimizations to support maximum performance with minimum redundant effort.
Logging: In practical terms, one of the most important capabilities provided by a platform is a logging facility. Logs should be able to be easily coalesced and time aligned so that behavior can be understood, and tuning can be performed at multiple scales. A good logging facility should include not only core events but also communication events, storage events etc. Logging also provides a substrate for adaptive tuning techniques.
Heuristic parameterization/modularization facility:
Tools must allow for the parametric specification of the process for handling loads, and specialized computational situations including:
- Spin up a new instance to keep size of computation under a specified resource limit, spin up a new instance if queue size increases beyond a limit, spin down an instance if queue size decreases below the limit for a significant period of time.
- Create parallel queues: including the ability to split/coalesce with bidirectional hooks to the heuristic parameterization facility.
- Constraint monitoring and enforcement
- Resources (min/max)
- Priority (time and space?)
- Memory access patterns
- Keep ready, if possible (cache even if last access time limit exceeded)
- Use once
These conceptually overlap with the parameterization/modularization facility but are geared more towards “within process” management rather than management of and across processes. They provide the capability for specifying “out of band” information to the platform to allow for better operations management. These annotations are not contracts with the platform, but hints.
There is a tension here between Automated vs. Manual support. Typically new problems, new algorithms, and new architectures push you in new directions and demand tools that provide greater manual control, since early on there is little understanding of the capabilities needed for adequate tuning, let alone how to achieve them algorithmically.
Over time, the automated tooling improves. The interactions between problem and architecture become understood and the greater analytic capability provided by improved hardware allows tuning decisions to be migrated from manual methods to the operational “autonomic” functions of the system itself.
The very old-school example of this is that to my knowledge no one uses the register designation for variables in C anymore.
Specific Useful annotations:
- Data flow specifications
- Persistence (process available for network wake up)
- Parent-child grouping management, (as developed for x10 ). Note: I consider X10’s win over FORTRESS another example of the advantage provided by the manual annotation in a cutting-edge computational environment.
- Process affinity (hierarchical).
- A priori spatial distribution of consumer processes.
- Memory access patterns3
- Keep ready
- Once only
Whether it’s rolling out a critical library patch, fixing a misbehaving component, or simple upgrade, many of the processes required from the infrastructure are the same.
- Identify the items requiring upgrade (IRUs).
- Identify the items depending upon the IRUs.
- Shield the dependent items from the upgrade. Depending upon the nature of the upgrade, this might include steps that degrade system performance, e.g., security concerns might require deep packet filtering.
- Perform the upgrade.
- Test the upgraded items.
- Deploy and monitor the upgraded items.
Maintenance is therefore critically dependent upon logging.
- When a process is initiated all components should be noted, by designation and SHA1 if possible.
- While a processes is running, all interactions should be noted at a sufficient level to at least be able to identify “A talks to B” even if we don’t know the content of the communication4
The first few levels of memory hierarchy are on-chip and I don’t expect them to be available for tuning.
That said all off chip storage components are available for tuning. The two most fruitful memory components are the reduction of rotating storage to a minimum (becoming standard practice) and the replication of data sets and computational results to preposition them for subsequent processing. My current thinking is that the easiest way to support this is to task certain cores with the management of storage that is “close to” a particular set of consumers. This storage manager would be able to preposition data at any accessible level of the core’s storage hierarchy (perhaps by accessing services within the target core).
Along with the platform and annotations support there is a role here for what I would term multicore lifting this is analogous to the lifting performed by compiler optimizers when they move repetitive operations outside of the loop. In the multicore case, the lifting initiates a computation on another core as soon as the information necessary to perform that operation is available and the necessity of performing the operation is reasonably assured.
Consider the simplified program graph below
If, when the program hits the green node, the computation performed by the blue node is both expected to be necessary and completely specified, the blue node’s computation can begin.
In the diagram, the “lift height” represents time expended in evaluating the gray nodes. This height specifies the maximum distance to search for potential nodes to perform the blue task.5 This distance (time) includes all latencies, data transfer times and any delays induced upon the main core by performing the transfer and receiving the results. If the two cores are different, the speed gains or delays must be factored inappropriately.
Note: Identifying “blue nodes” is the type of activity facilitated by such languages as Haskell and techniques such as dataflow annotations -- obviously the results of the “blue nodes” can be distributed throughout the system as necessary.
The simplest way to characterize my answer to the “how would you program 10,000 cores” problem is “I’d do everything I can to assure that they are doing something useful, otherwise the goal is to minimize their power consumption”
A great deal is set in place before programming starts: since the nature of the cores being used, the wiring of the data center; the number of data centers are already fixed at that point and difficult to change quickly.
Platform capabilities provide the software environment in which the program operates and represent a second “external” determinant of performance. Therefore it is necessary to provide facilities that not only can evolve as usage patterns and technology changes but also can capture the information necessary to inform those changes.
The programs themselves are fundamentally constrained by these decisions. Languages that simplify the identification of identical constructs, non-mutated values and minimize the requirements for explicit synchronization form the next layer of environmental choices that impact performance but are not strictly a part of the program itself.
I have not concentrated on specific solutions to specific problems, since the domain under consideration wasn’t specified, and in my experience empirical results from simulated or actual program runs are necessary for optimization. This is therefore highly dependent upon the nature of the problem, the problem data set and the processing environment.1. All decisions are predicated upon the type of problems being solved. I try to cover most “reasonable” options but concentrate on those that would seem likely.↩
3. Similar in spirit to those capabilities provided by vmtouch .↩
4. Expect that there’s a whole literature on VM maintenance. However, since it’s not my area of expertise, I don’t think it’s appropriate to address it here.↩
5. Assuming the vertical axis is appropriately scaled time↩
Friday, July 20, 2012
I work with multiple projects in the same workspace. Each project uses the same library code, adding a few tweaks + resources (large pdf files) and metadata.
To create a new project, my practice has been to duplicate an existing project and then pull it into the workspace, This process isn't completely trivial, so I thought I post it.
Here's the recipe:
- Copy the project folder in finder and rename
- Add to workspace using these directions (right click, add file, pick .xcodeproj file)
Rename the project as shown below
- If your appDelegate has the project name in it (per convention), it is necessary to not only use the refactoring tool to rename it, but also search for appDelegate & rename it in all files that reference it via a string (main.m is the usage I have in my project)
int main(int argc, char *argv)
NSAutoreleasePool *pool = [NSAutoreleasePool new];
int retVal = UIApplicationMain(argc, argv, nil, @"detroit_gary_and_some_chicagoAppDelegate");
- Change the prefix header
- Finally to change the name of the scheme (which is how it appears in the xCode dropdown) you have to lather the scheme by doing the following (from http://stackoverflow.com/questions/5346767/is-there-a-way-to-rename-an-xcode-4-scheme):
Not the easiest process, but at least it lends itself to a straightforward recipe.
It follows that Best Practice is to minimize the number of "project specific" names that you use in the project, minimizing the number of changes. This recipe only covers the case in which you change the project name, project folder and appDelegate.
Sunday, July 1, 2012
Upthere.com has posted an interesting problem:
When mentioning this to a (non-computational) friend, the question came up “what's a core?” -- certainly a legitimate question.
I'm posting my answer because it is short, and apparently comprehensible.
A core is sort of like programming a ten-year-old personal computer (before they went multi-core). To first approximation, it is one processor; doing one thing at a time.
The next level of complexity (still single-core) adds multiple different types of computational units to the "core". A good example is a floating-point unit. That means if you want to add
A&B are integers and
C&D are floating point, you can do
"A+B" in the integer unit at the same time you are doing
"C+D" in the floating point unit.
The biggest level of complexity has to do with the (multi) thousand fold speed difference between the "core" and external memory (RAM + Disk). In the worst case this would have the core doing nothing 99.9% of the time.
This would be bad, so there are a lot of tricks used to keep the core busy while it is waiting for other stuff to happen. Think of this as waiting for a particular result from a database to fill in a form -- If clever (& lucky) you could build the whole rest of the form while you're waiting.
Tuesday, June 19, 2012
As the owner of a new Retina MacBook Pro, I've been happily surprised with its speed: The combination of Adobe Photoshop CS 6 and the new MacBook is over 100 times faster than CS 5 on 2008 era Mac Pro (tower)! CS 6 was a major speed improvement, as is the new MacBook Pro. The combination is breathtaking.
It made me think about other massive speed bumps I've seen in my career:
- A Pixar image computer attached to a Symbolics Lisp machine (~1988): near immediate warping of 1K by 1K image -- this is back when doing a histogram of a 256x256 image took a noticeable amount of time.
- A Mac Iici running MCL (Macintosh Common Lisp): 10 X faster (in floating point!) than a Symbolics 3670, at a 20th of price (~1992)!
- A Sun Enterprise 4500 with 12 x 400 MHz SPARC processors (~1998), able to run 10 Tomcat instances and serve hundreds of users without slowing down, when our last server had trouble with one (admittedly massive) Perl/CGI script.
- The last Mac G5 tower (2005?) which felt almost as fast as the above Sun 4500 (albeit with one user ( plus database, plus IDE)).
- And the one that prompted this post: Photoshop on the new Retina MacBook Pro (2012): lens blur of a 10 megapixel image going from over a minute to immediate!
All of these involved software changes as well, and admittedly are task level benchmarks than processor or software benchmarks -- but tasks are what matter in the end.
Another point of note: both the first and last items on the list result from taking advantage of specialized graphics processors, that was a core change in movie from Photoshop CS 5 to Photoshop CS 6, and of course the Pixar itself was a graphics processor.
Monday, June 18, 2012
Tuesday, March 20, 2012
I found a career categorization on an ACM site that I found particularly resonant. It breaks "software development"/IT/"computation" down into four areas:
- Career Path 1: Designing and implementing software.
- Career Path 2: Devising new ways to use computers.
- Career Path 3: Developing effective ways to solve computing problems.
- Career Path 4: Planning and managing organizational technology infrastructure.
- Technology infrastructure: DBA's, network/storage system development, architecture (ACM career path 4)
- Software engineering: How to make software that's modular, easy to ship and robust (ACM career path 1)
- Algorithm development: new techniques for doing computation (ACM career path 3)
- User experience/productivity enhancement: helping people do their jobs better/more effectively (ACM career path 2)
Tuesday, January 24, 2012
I just finished reading Eric Ries' The Lean Startup, and am happy to say that it was much better than I expected.
When I heard about the book I thought that it only applied to cloud based web startups: the kind of shops that could readily perform A/B testing over a weekend, an approach pioneered by Google early on.
My reaction to this was predictable: that's fine when you're making simple changes to web pages, but what if you're doing some heavy technical lifting, e.g., new technology (Watson), hard technology with deep foundational requirements (Dropbox), etc.
Ries describes something more subtle. He develops a general methodology that allows quantitative evaluation of assumptions, as quickly as possible, with the least possible amount of effort.
Dropbox is one of his most striking examples
The challenge was that it was impossible to demonstrate the working software in prototype form. The product required that they overcome significant technical hurdles; it also had an online service component that required high reliability and availability. To avoid the risk of warming up after years of development with product nobody wanted, Drew did something unexpected easy: he made a video.
The goal of the video was to validate the assumption that people would actually be interested in such a product. Of course in the Dropbox case the interest was there, but the book is filled with numerous examples in which it wasn't.
Ries' core approach consists of a large scale feedback loop, shown below (taken from here):
My two core take aways are
- MVP--the minimum viable product, and
- Measure-- the importance of making clear, tangible predictions ahead of time.
Both ideas are important in any product that has a customer focus, whether the customers are online consumers or members of an internal department.
They are reminiscent of the techniques my group used years ago to measure uptake of product features by internal departments: if we deployed a page with a particular group in mind, we could see if they continued to use it a few weeks out. If not, we'd visit them and ask what the problem was. There wasn't any need to wait for feature requests (or to hear in a budget meeting that you weren't providing any value). We sought them out and fixed the problem, if at all possible. Monitoring at page level allowed us to make these judgments for small product increments.
MVP is simply getting something in front of people as quickly as possible -- it doesn't need to be working, doesn't need to scale, doesn't need to be fully polished, but it does need to give a feel as to why someone would want to incorporate the product into their life.
Measure is more generally important. Ries is perfectly willing to start with measurable goals that seem almost trivial. The idea being that in short order these goals can eliminate easy workarounds, myths of low hanging fruit, etc.
In one of his examples, he aimed for a revenue of a few hundred dollars a month to start. A few hundred a month sounds trivial, and certainly isn't sustainable, but after a few months he couldn't even do that.
The point is that if his goal was the few million dollars a year he'd need for true sustainability, it would have taken him years to realize that he was on the wrong track. This because realistically, even in the best case, the million dollar goal would require a few years to achieve, pushing feedback out to years rather than months.
The Lean Startup doesn't just focus on startups, it includes of a number of examples from groups within large non-software companies (Proctor and Gamble) and established software companies (Intuit) etc. shows that startup is more an attitude than a corporate structure.
I highly recommend this book, I don't consider it so much a Lean Startup book, but an optimized effort book: How to decide if your efforts are wasting your time or actually taking you in the direction that you want to go.
To be clear: I don't think that this means that data measured at a fine grained level is the only way of gathering feedback, e.g., it's hard to incrementally identify the best design for a large website c.f., Goodbye, Google. However, even in these cases, an MVP is important. It is vital to ground your ideas with your target audience, sooner rather than later (even if it's with a short video of a simulation of your idealized goal).
Tuesday, January 17, 2012
I know I'm late to this, but I'll share anyway:
If you're actively developing or supporting anything with a web front end, take a look at watir for automated smoke and regression tests. The site has a "watir in 5 minutes" page, but probably really takes more like 3, it's that well designed.Watir stands for:Web Application Testing in Ruby. It is:
- Open source
- Simple to install: just a ruby library (gem)
- All of ruby is available when writing test scripts
- Well regarded
- Widely used. Here's a partial list:
- “W3C” web only!
- No recording
- Recording tools are available, but not part of the core effort
- No plug-ins (active-x, flash)
- Widely used
- Very easy to use
- Programmer friendly
- Ruby is sane
- Terse without being obtuse
- You can debug scripts interactively using irb
For example, this screenshot shows a section of a browser wind and an irb window about to execute a method to click on the browser's edit button:
Page elements are referenced by
- Type and Id/displayed-text/link destination etc
- or Index
- @browser.select_list(:id, 'StatusSelect')
- Index is obviously not the preferred approach, but it is occasionally necessary, e.g., for pages with four identical “submit” buttons
On a maintainability note: I recently had to change a script from FireFox (which I'm increasingly becoming disenchanted with) to Chrome. The change only required altering 2 lines of code
# @browser = Watir::Browser.new
@browser = Watir::Browser.new(:chrome)
However, I did find the behavior a bit different between Chrome and Firefox: If a button wasn't visible on the page Chrome wouldn't click it. I never had a issue with this in Firefox.
Update:, upon further investigation it appears that may be the behavior of the newer versions of watir, rather than a Chrome compatibility issue.