Reviewing Guidelines for Program Committee Members

I prepared these guidelines for the PLDI Program Committee when I was program chair in 2016. I am posting this lightly-edited version in the hopes that it will be useful to other program chairs. Feel free to adapt some or all of it for your own use; if you do so, I just ask that you cite this page and point me to your page in the hopes of providing an easy go-to for future program chairs and PC members. Thanks!

Reviewing Guidelines

All committee members are expected to:

  • Personally read and write reviews all of their assigned papers. The reviews should be entirely of the reviewer’s own devising. If you want to invite someone for a supplementary review, let me know and I will handle it.
  • Write positive, detailed, and constructive reviews that will help the authors revise their papers and make them better.
  • Not seek to break double-blind reviewing by Googling (or Binging, for those of you who do that) or other means.
  • Turn your reviews in ON TIME.
  • Actively participate in all on-line discussions of the papers, and (for EPC members), participate in a teleconference to discuss PC papers prior to the physical PC meeting.

Reviews will take the following questions into account:

  • Is the paper well-motivated? What problem does it address, and is it an important problem?
  • Does the paper significantly advance the state of the art or break new ground?
  • What are the paper’s key insights?
  • What are the paper’s key scientific and technical contributions?
  • Does the paper credibly support its claimed contributions?
  • What did you learn from the paper?
  • Is the paper sufficiently clear that most PLDI attendees will be able to read and understand it?
  • Does the paper clearly establish its context with respect to prior work? Does it discuss prior work accurately and completely? Are comparisons with previous work clear and explicit?
  • Does the paper describe something that has actually been implemented? If so, has it been evaluated properly? Is it publicly available so that these results can be verified?
  • What impact is this paper likely to have (on theory & practice)?
  • Is the work of broad appeal and interest to the PLDI community?

A key part of ensuring quality reviews is making sure that papers are reviewed by experts. All reviewers will indicate specifically what the nature of their expertise is with respect to each paper, e.g., “I have written papers (X, Y, & Z) on this topic.”

Guardians: One reviewer will be appointed as a “guardian” to lead all discussions and ensure that author responses are read and addressed. The guardian will also ensure that final reviews include a summary of the online and/or PC discussion to explain decisions for acceptance/rejection.


This section is adapted from the bidding instructions from POPL 2015.

Bidding will be carried out in HotCRP. Each PC, EPC, and ERC member should enter a bid for every paper. A bid (called a review preference in HotCRP) is a combination of two things:

  • an integer between 3 and -3, inclusive, or -100 for a conflict, which indicates how much you would like to review the paper.
    • 3: I would really like to review this paper!
    • 2: I would like to review this paper a lot, but it isn’t one of my absolute top favorites.
    • 1: I would like to review this paper more than average
    • 0: I don’t care one way or the other
    • -1: I would like to review this paper less than average
    • -2: I do not want to review this paper very much at all, but doing so won’t kill me.
    • -3: I really don’t want to review this paper!
  • a letter (X, Y, Z) that indicates how much expertise you expect to have concerning a paper
    • X: I expect to be an expert reviewer for this paper. Experts should be able to understand the technical content of the paper (unless the paper is particularly poorly written) and are acutely aware of related research in the area (i.e., you have written a paper on the topic).
    • Y: I expect to be a more knowledgeable reviewer than most for this paper, because I generally follow the literature in this area.
    • Z: I am an outsider. I do not expect to have any special knowledge of the topics discussed in this paper.

Positive numbers in your review preference mean you have greater than average desire to review the paper. Negative numbers mean you have a less than average desire to review the paper. A score of −100 means you think you have a conflict. Examples:

  • A preference of 3X means you really want to review the paper a lot and expect to be an expert.
  • A preference of 2Z indicates you want to review this paper somewhat less than the one you scored a 3, but you still want to review it a lot and you expect to be an outsider.
  • A score of -3X means you really do not want to review this paper at all, but expect to be an expert on the topic.
  • A score of 0Z means you do not care very much one way or the other whether you are assigned this paper or not, and expect to be an outsider.

There are probably a number of ways to game this preference system. Please don’t try. For example, if you assign 20 papers a 3 and every other paper a -3, you won’t get those 20 and I won’t know which ones you really want or don’t want. (We will automatically check the distribution of scores, and if you do this, you will make the program chair unhappy.) I’d like everyone to be excited about the stack of papers they receive to review but naturally, I will have to balance that against the need to ensure papers have proper expertise assigned to them.

When bidding, you won’t have to read the entire paper (though, of course, you are free to look at any paper in depth when bidding), so you are only estimating your expertise. If, when you review a paper, you find you have made a mistake in your estimate, that is just fine, and is bound to happen from time to time. If you find yourself downgrading your expertise from an X, we might find a paper suddenly lacking expert reviewers. In such a case, feel free to alert the PC chair. I’ll see what I can do.

Entering bids in HotCRP: There is more than one way you can enter bids in to HotCRP. One way to begin is to go the reviewer preferences page. There, you will see a list that shows all submitted papers. You may enter your preferences in the text boxes here. Alternatively, you may flip through the paper pages (use keys k and j to flip forwards and backwards through the paper pages efficiently). If you go through the papers in numeric order, flip a coin first to decide whether you will go through them back to front or front to back. You may also upload preferences from a text file; see the “Download” and “Upload” links below the paper list on your review preferences page.


PLDI 2016 will be shepherding all accepted papers. Please write your reviews taking this into account. We are doing this for a variety of reasons, including improving paper quality and letting us accept papers with flaws that can easily be fixed. Shepherding will enforce that all *minor* changes requested by reviewers are incorporated in the final paper. Making this apply to all papers means that there will be no stigma attached to having a paper shepherded.

This approach will let us accept papers, for example, that do not cite certain related papers that reviewers feel should be discussed. We can also require that authors address minor stylistic issues to enhance readability, so those kind of things should not be deal-breakers for acceptance.

However, there is a limit to how much we can expect of the shepherding process. For example, if a paper’s evaluation is unacceptable, that is probably not something that can be salvaged during shepherding. The same is true for cases when the technical core of the paper is impenetrable.

In your reviews (either in the comments to authors, the PC, or both), please feel free to point out things that will need to be addressed during shepherding.


  1. Who is reviewing PC member submissions?
    • The new External Program Committee will be responsible for reviewing all PC submissions.
  2. What’s the role of the External Program Committee?
    • The External Program Committee is an innovation that was recently approved by the PLDI Steering Committee. Its aim is to guarantee an extremely high quality reviewing process by forming a committee composed of the leaders of our field. This approach is inspired by the standard practice in the systems community, where senior members of the community are invited to serve on a “light” Program Committee that reviews fewer papers (e.g., 10) than the “heavy” Program Committee and does not attend or participate in the physical meeting. In addition to providing a group of distinguished experts who can be counted upon to provide expert reviews across the areas of PLDI, the External Program Committee will review and make the decisions for all Program Committee submissions, making its job incredibly important.
  3. What is the role of the ERC?
    • In a departure from recent PLDIs, this year the ERC will serve primarily as a stable for obtaining expert reviews as needed, and not as a load-shedding mechanism or as a means of handling PC submissions (which now are handled by the EPC). The ERC is also going to be wider than usual, including experienced senior graduate students.
  4. How long should my reviews be?
    • You should aim for your reviews to be approximately 500 words long. HotCRP has a feature that enables searching for reviews by the number of words: you can see all of your reviews with fewer than 500 words by entering this in the search bar: “re:me:words<500”.
  5. What does “expertise” mean for the purposes of reviewing?
    • You should enter a sentence or two explaining what your expertise level is for each paper you review. The working definition of “expert” is that you have written one or more papers on the topic – you should indicate the titles and dates. Knowledgeable means that you follow the literature on this topic but may have missed recent developments.
  6. Are we doing two rounds of reviewing?
    • Yes. There will be two rounds, immediately followed by an author response period.
  7. When will authors be unblinded?
    • Only accepted papers will be unblinded to preserve the integrity of double-blind reviewing for future submissions. The author responses will also be anonymous.
  8. Can I enlist a student (or trusted colleague) to work with me on reviewing a paper?
    • Briefly, no. In more detail: every committee member is expected to write their own, independent reviews for every paper. If you believe that having a student do an expert review of a paper would be helpful, please let me know and if it is appropriate (e.g., there are no conflicts), they can always be invited as an expert reviewers. Also, please do not distribute your assignments to anyone without vetting by the chair; because of double-blind reviewing, there may be authorship conflicts you are not aware of. Just send the chair a note. Because having students write reviews is an important part of the training process, this year we are specifically opening up the ERC to senior graduate students for this exact reason. If you do not want a student to actually submit a review but just want them to write a “test” review (one that will not actually be sent to the authors), that can also be arranged, but again, please vet this with the chair.
  9. How should I avoid breaking double-blind reviewing when searching for related work? For instance, reading the cited previous paper that sounds like it is most technically similar, and finding a figure that’s identical to the paper I’m reviewing is a pretty strong indication about authorship.
    • Inadvertent discovery of authorship is sometimes unavoidable. Here are some ways of reducing the risk of stumbling across something and thus breaking double-blind.
      • Initially read the paper off-line and write a preliminary review assuming that the authors have done their homework properly (in terms of scholarship, citing previous work, etc.).
      • Look for related work as a matter of due diligence after writing that review, and revise if needed.
      • To avoid accidentally unblinding the authors, don’t type the title into Google.
  10. How long will the author response be? Will there be a hard limit on the number of words? (I have lots of questions I’d like the authors to answer.)
    • The author response will only have a soft limit, but reviewers are required to only read the first 600 words (roughly one page of text). Here’s the message that will be sent out to the authors.

      The authors’ response should address reviewer concerns and correct misunderstandings. In particular, respond to explicit questions by reviewers. Make it short and to the point; the conference deadline has passed. Try to stay within 600 words. You may write more words but reviewers are only required to read the first 600 words, so address your key points first.


A Guide for Session Chairs

I just sent this message as a guide to the program committee members who will be chairing sessions for PLDI 2016 (I figure it’s the first time for some of them). A few people suggested I post it, so here it is (lightly edited). Additions or other suggestions welcome.

  • Find your speakers before the session begins. You will have to talk to them about some stuff – see below.
  • Find out how to pronounce their names properly.
  • Find out if they are on the market next year – sometimes people like the advertisement that they will be graduating soon.
  • Have them check their equipment (particularly if they are using Linux…). To be on the safe side, carry a spare Mac VGI dongle – speakers forget this shockingly often. You should consider writing your name on it in Sharpie (or do what one of my students does – cover it in bright pink fingernail polish). This greatly increases the odds you will get your dongle back after the session.
  • Before each session, introduce the entire session (as in, “I am So-and-So, from Wherever University; welcome to the session on drone-based programming languages.”
  • Before each talk, introduce each speaker. I personally recommend not reading their title, since lots of speakers are on autopilot and will just repeat everything you said. You can instead say something like “This is Foo Bar, who will be talking about verifying clown car drivers.” In fact, come to think of it, you could just say that for every talk.
  • Keep track of time. For PLDI this year, speakers get 25 minutes, and then there are 5 minutes for questions. If you have an iPad, there’s an app I have used to display time to speakers (big giant numbers, you can set it to change colors when you hit 5 min or 1 min till the end). You can of course always go old school and hold up a sheet of paper indicating when time is drawing near. I recommend doing this when there are 5 minutes left and 1 minute left. Let the speakers know you will be doing this.
  • When the speaker is done, if it hasn’t happened already, make sure everyone applauds by saying “Let’s thank our speaker” and start applauding. Then open the floor to questions.
  • COME PREPARED WITH A QUESTION. The worst thing ever is when the talk is a disaster does not go well and no one has any questions for the speaker, and then: <crickets>. Read over each paper so you have at least a couple of questions planned for this eventuality. Hopefully it won’t come to this and someone will ask something, but it happens sometimes, and it’s great if you can save the day. It’s still a good idea to ask a question or two in case there are very few questions from the audience.
  • Make sure people who ask questions use the mic and state their name and affiliation.
  • You may also have to clarify the question for the speaker, repeat the question, etc. Understanding questioners can occasionally be a challenge for non-native English speakers: it’s a stressful time, and the questioners may have unfamiliar accents, etc. Be prepared to give the speaker a helping hand.
  • Be prepared to cut off a questioner. YOU ARE IN CHARGE OF THE SESSION. If a questioner won’t give up the mic and keeps asking questions and is burning time, rambling, etc., you are empowered to move on to the next questioner (e.g., by suggesting “how about we take this off-line”).
  • Hopefully this won’t be an issue you will have to deal with, but questioners who are belligerent or insulting must not be tolerated. Cut them off and report them to the program chair (me) or the general chair. I sincerely hope and expect that this will not happen, but I want you to realize you are empowered to take action immediately. You can read over SIGPLAN’s non-harassment policy here, which is based on ACM’s:
  • To make sure things run smoothly, have the next speaker on deck with their laptop a minute or so before question times end. Ideally, they will be setting up while the current speaker is wrapping up questions.
  • Finally, when questions are over, say “Let’s thank our speaker again” and applaud.
  • At the end of the session, tell everyone what’s next (e.g., “next is lunch, and talks will resume at 1:30pm”).

And thanks again to all the session chairs for volunteering!


Coz: Finding code that counts with causal profiling

Nice summary of Coz.

the morning paper

Coz: Finding code that counts with causal profiling – Curtsinger & Berger 2015

update: fixed typo in paper title

Sticking to the theme of ‘understanding what our systems are doing,’ but focusing on a single process, Coz is a causal profiler. In essence, it makes the output of a profiler much more useful to you by showing you where optimisations would genuinely have a beneficial effect (which doesn’t always equate with the places programs spend the most time). Interestingly, it can also show you places where locally optimising performance will actually slow down the overall system. That might sound counter-intuitive: the Universal Scalability Law gives us some clues as to why this might be. The understanding gained from finding such locations is also very useful in optimising the application overall.

Conventional profilers rank code by its contribution to total execution time. Prominent examples include oprofile, perf, and gprof. Unfortunately, even…

View original post 1,647 more words

Doppio Selected as SIGPLAN Research Highlight

Doppio, our work on making it possible to run general-purpose applications inside the browser, recently won two awards. At PLDI, it received the Distinguished Artifact Award. SIGPLAN, the Special Interest Group of ACM that focuses on Programming Languages, just selected Doppio as a Research Highlight. These papers are chosen by a board from across the PL community; SIGPLAN highlights are also recommended for consideration for the CACM Research Highlights section.

Below is the citation. IMHO John did an extraordinary job on the paper and the system, and I am glad to see that the community agrees!

Title: Doppio: Breaking the Browser Language Barrier
Authors: John Vilk, Emery Berger, University of Massachusetts
Venue: PLDI 2014

The authors build a JavaScript-based framework, Doppio, in which unmodified programs can be executed within a web browser. They do this by creating a runtime environment in JavaScript that supports basic services such as sockets, threading, and a filesystem that are not otherwise supported within the browser. The authors demonstrate the framework by implementing an in-browser JVM and an in-browser runtime for C++. The paper is an engineering tour de force. The paper should appeal to a wide audience because of the ubiquity of the browser (and thus the utility of their systems), and because it is broad in scope.

Washington Post, Take Down This Article!


The Washington Post just published an article from a kid claiming he graduated at the top of his class at Penn State in Computer Science but couldn’t find a job. But his description of Computer Science classes is completely disconnected from reality. Turns out, he graduated with a degree in Management Information Systems (a business degree) and not from the Penn State any reasonable person would assume, but rather a satellite campus. All this info is right on the dude’s own LinkedIn page and a previous version of the article from Sept. 2013. Washington Post, Take Down This Article!

[This was initially publicly posted on Facebook here:]

Update – I wrote a Letter to the Editor of the Washington Post. They did not choose to print it, though they did partially correct the article.


Dear Editor:

A recent op-ed article by Casey Ark (“I studied computer science, not English. I still can’t find a job.”, August 31) is deceptive and misleading. Ark says he graduated at the top of his class at Penn State in Computer Science but found himself unable to find a job. All of these claims are false. An accurate headline would read “I studied business, not English. I had job opportunities, but I turned them down.”

Ark’s descriptions of his class experiences — non-rigorous, memorization-based, and non-technical — sound nothing like a Computer Science degree, and here’s why. A visit to his LinkedIn page ( shows that he graduated with a degree in Management Information Systems, a non-technical business degree that has little to do with Computer Science and is decidedly not a STEM (Science, Technology, Engineering, and Math) field.

Ark also fails to mention that he attended a satellite campus rather than the more prestigious flagship University Park campus of Penn State, a fact included in an earlier version of this article that appeared on PennLive in September 2013
( Regardless of its quality, leaving out the location leads readers to believe he graduated from the main campus.

In this earlier article, Ark describes having chosen to not take two entry-level job options, but instead deciding to become an entrepreneur.

I am surprised and chagrined that this op-ed made it through whatever fact-checking mechanisms exist at Washington Post, when a few moments with Google sufficed to discredit the central claims of the article.

Professor Emery Berger
School of Computer Science
University of Massachusetts Amherst


Professor Stephen A. Edwards
Department of Computer Science
Columbia University in the City of New York

Asst. Professor Brandon Lucia
Department of Electrical and Computer Engineering
Carnegie Mellon University

Associate Professor Daniel A. Jiménez
Department of Computer Science & Engineering
Texas A&M University

Assistant Professor David Van Horn
Department of Computer Science
University of Maryland, College Park

Assistant Professor Santosh Nagarakatte
Department of Computer Science
Rutgers, The State University of New Jersey, New Brunswick

Assistant Professor Swarat Chaudhuri
Department of Computer Science
Rice University

Associate Professor Dan Grossman
Department of Computer Science & Engineering
University of Washington

Professor Michael Hicks (B.S. Computer Science, Penn State ‘93)
Department of Computer Science
University of Maryland

Associate Professor Matthew Hertz
Department of Computer Science
Canisius College

Associate Professor Landon Cox
Department of Computer Science
Duke University

Associate Professor Benjamin Liblit (B.S. Computer Science, Penn State ‘93)
Department of Computer Sciences
University of Wisconsin–Madison

Associate Professor John Regehr
School of Computing
University of Utah

Professor Jeff Foster
Department of Computer Science
University of Maryland, College Park

Kaushik Veeraraghavan

Some comments from the Facebook thread posted by my fellow Computer Science colleagues:

Daniel Ángel Jiménez This kind of garbage causes lots of confusion. At my last job, almost all of the complaints from local industry about our CS graduates turned out to actually be about morons from the business school.

Shriram Krishnamurthi “Correction: An earlier version of this story’s headline misidentified what the author studied. It has been corrected.” They changed “engineering” to “computer science”. Thanks, WaPo!

Rob Ennals It seems that whenever I read a media article about something I actually know about, there is something fundamentally wrong with their understanding of the situation. This makes me worry about the accuracy of the information I’m getting about things I’m not knowledgable about.

Emery Berger He laments “they’re looking for employees who can actually do things – like build iPhone apps…. I wish I’d been taught how to do those things in school, but my college had something different in mind.”

PSU offers CMPSC 475, WHICH TEACHES iOS PROGRAMMING.…/courses/C/CMPSC/475/201314SP

Tao Xie Another very important piece of information (from the original earlier post:…/heres_why_why_more_and_more…), “When I graduated from PSU’s Harrisburg campus in May, ….” This kid graduated from PSU Harrisburg Campus, **NOT** the State College campus!! There are 24 campuses of PSU ( Note that the Washington Post article (carefully?) “rephrased” the above quoted sentence to be “When I graduated from Penn State a year ago, …” smh..

Stephen A. Edwards Breathtaking naivete on display in this column. I have no idea what he was studying: any CS graduate shouldn’t have any idea about the difference between advertising and marketing. His lament about all the programming languages and tools I learned were years out of date is also laughable. Of course they’re out of date: everything in CS is more-or-less instantly. The thing is to make sure you understand the basic concepts so you can learn the new stuff faster. But I really got a chuckle about his suggestion that we be more lax about academic standards and hire better businesspeople. Absolutely that will improve the quality of your education, no question.

New Scientist coverage of our AutoMan project

The New Scientist has just published an article covering our AutoMan project, which makes it possible to program with people. Full article below. Reasonably accurate, though it’s my team, not Dan’s :). Also on the project are my student Charlie Curtsinger, and my UMass colleague Andrew McGregor.

Continue reading New Scientist coverage of our AutoMan project

Most Influential Paper of OOPSLA 2002: “Reconsidering Custom Memory Allocation”

Our paper, Reconsidering Custom Memory Allocation, was just granted the Most Influential OOPSLA Paper award (given ten years after the paper appeared). Here’s the citation for the award.Influential-Paper-OOPSLA

Custom memory management is often used in systems software for the purpose of decreasing the cost of allocation and tightly controlling memory footprint of software. Until 2002, it was taken for granted that application-specific memory allocators were superior to general purpose libraries. Berger, Zorn and McKinley’s paper demonstrated through a rigorous empirical study that this assumption is not well-founded, and gave insights into the reasons why general purpose allocators can outperform handcrafted ones. The paper also stands out for the quality of its empirical methodology.

I am grateful to OOPSLA not only for the check for $333.33 :), but also for giving me the chance to publicly stand up and thank my wonderful co-authors: my excellent colleague Ben Zorn and my awesome advisor, Kathryn McKinley (both now at Microsoft Research). The original paper actually did a bit more than the citation – here’s the abstract from the original paper.

Programmers hoping to achieve performance improvements often use custom memory allocators. This in-depth study examines eight applications that use custom allocators. Surprisingly, for six of these applications, a state-of-the-art general-purpose allocator (the Lea allocator) performs as well as or better than the custom allocators. The two exceptions use regions, which deliver higher performance (improvements of up to 44%). Regions also reduce programmer burden and eliminate a source of memory leaks. However, we show that the inability of programmers to free individual objects within regions can lead to a substantial increase in memory consumption. Worse, this limitation precludes the use of regions for common programming idioms, reducing their usefulness.

We present a generalization of general-purpose and region-based allocators that we call reaps. Reaps are a combination of regions and heaps, providing a full range of region semantics with the addition of individual object deletion. We show that our implementation of reaps provides high performance, outperforming other allocators with region-like semantics. We then use a case study to demonstrate the space advantages and software engineering benefits of reaps in practice. Our results indicate that programmers needing fast regions should use reaps, and that most programmers considering custom allocators should instead use the Lea allocator.

Slight correction: they should instead use Hoard :).

Software Needs Seatbelts and Airbags

(This post is a draft version of an article slated to appear in ACM Queue.)

Finding and fixing bugs in deployed software is difficult and time-consuming: here are some alternatives.

Death, Taxes, and Bugs.

Like death and taxes, buggy code is an unfortunate fact of life. Nearly every program ships with known bugs, and probably all of them end up with bugs that are only discovered post-deployment. There are many reasons for this sad state of affairs.

Unsafe Languages.

One problem is that many applications are written in memory-unsafe languages. Variants of C, including C++ and Objective-C, are especially vulnerable to memory errors like buffer overflows and dangling pointers (use-after-free bugs). These bugs, which can lead to crashes, erroneous execution, and security vulnerabilities, are notoriously challenging to repair.

Safe Languages: No Panacea.

Writing new applications in memory-safe languages like Java instead of C/C++ would go some way towards mitigating these problems. For example, because Java uses garbage collection, Java programs are not susceptible to use-after-free bugs; similarly, because Java always performs bounds-checking, Java applications cannot suffer memory corruption due to buffer overflows.

That said, safe languages are no cure-all. Java programs still suffer from buffer overflows and null pointer dereferences, though they throw an exception as soon as they happen, unlike their C-based counterparts. The common recourse to these exceptions is to abort execution and print a stack trace (even to a web page!). Java is also just as vulnerable as any other language to concurrency errors like race conditions and deadlocks.

There are both practical and technical reasons not to use safe languages. First, it is generally not feasible to rewrite existing code because of the cost and time involved, not to mention the risk of introducing new bugs. Second, languages that rely on garbage collection are not a good fit for programs that need high performance or which make extensive use of available physical memory, since garbage collection always requires some extra memory [5]. These include OS-level services, database managers, search engines, and physics-based games.

Are Tools the Answer?

While tools can help, they too cannot catch all bugs. Static analyzers have made enormous strides in recent years, but many bugs remain out of reach. Rather than swamp developers with false positive reports, most modern static analyzers report far fewer bugs than they could. In other words, they trade false negatives (failing to report real bugs) for lower false positive rates. That makes these tools more usable, but also means that they will fail to report real bugs. Dawson Engler and his colleagues made exactly this choice for Coverity’s “unsound” static analyzer (see the Communications of the ACM article on their experiences: A Few Billion Lines of Code Later: Using Static Analysis to Find Bugs in the Real World [4].

Testing is Good but Not Enough.

The state of the art in testing tools has also advanced dramatically in the last decade. Randomized fuzz testing can be combined with static analysis to drive tests to explore paths that lead to failure. These tools are now in the mainstream: for example, Microsoft’s Driver Verifier can test device driver code for a wide variety of problems, and now includes randomized concurrency stress testing.

But as Dijkstra famously remarked, “Program testing can be used to show the presence of bugs, but never to show their absence!” At some point, testing will fail to turn up new bugs, which will unfortunately be discovered only once the software has shipped.

Fixing Bugs: Risky (and Slow) Business.

Finding the bugs is only the first step. Once a bug is found, whether by inspection, testing, or analysis, fixing it remains a challenge. Any bug fix must be undertaken with extreme care, since any new code runs the risk of introducing yet more bugs. Developers must construct and carefully test a patch to ensure that it fixes the bug without introducing any new ones. This process can be costly and time-consuming. For example, according to Symantec, the average time between the discovery of a remotely exploitable memory error and the release of a patch for enterprise applications is 28 days [11].

Cut Bait and Ship.

At some point, it simply stops making economic sense to fix certain bugs. Tracking their source is often difficult and time consuming, even when the full memory state and all inputs to the program are available. Obviously, show-stopper bugs must be fixed. For other bugs, the benefits of fixing them may be outweighed by the risks of creating new bugs and the costs in programmer time and in delayed deployment.

Debugging at a Distance.

Once that faulty software has been deployed, the problem of chasing down and repairing bugs becomes exponentially more difficult. Users rarely provide detailed bug reports that allow developers to reproduce the problem.

For deployed software on desktops or mobile devices, getting enough information to find a bug can be difficult. Sending an entire core file is generally impractical, especially over a mobile connection. Typically, the best one can hope for is some logging messages and a minidump consisting of a stack trace and information about thread contexts.

Even this limited information can provide valuable clues. If a particular function appears on many stack traces observed during crashes, then that function is a likely culprit. Microsoft Windows includes an application debugger (formerly “Watson”, now “Windows Error Reporting”) that is used to perform this sort of triage not only for Microsoft but also for third-party applications via Microsoft’s Winqual program. Google also has made available a cross-platform tool called Breakpad that can be used to provide similar services for any application.

However, for many bugs, the kind of information that these tools provide is of limited value. For example, memory corruption errors often trigger failures millions of instructions past the point of the actual error, making stack traces useless. The same is generally true for null dereference exceptions, where the error often happens long after the null pointer was stored.

Captain’s Log: Not Enough Information.

On servers, the situation is somewhat improved. Server applications typically generate log messages, which may contain clues to why a program failed. Unfortunately, log files can be unmanageably large. Poring over logs and trying to correlate them to the source code can be extremely time-consuming. Even worse, that work may yield no useful results because the logs are incomplete; that is, the logs simply may not provide enough information to narrow down the source of a particular error because there were not enough or the right kind of log messages. Recent work at Illinois and UC San Diego may lead to tools that address some of these problems; SherLog [12] automates the process of tracing back bugs from log messages to buggy source code paths, and LogEnhancer [13] automatically extends log messages with information to help post-crash debugging. More information on logging appears in a recent Queue article, Advances and Challenges in Log Analysis [10].

God Doesn’t Play Dice, but Your Computer Does.

Despite these advances, finding bugs has actually become harder than ever. Back when many programs were sequential it was already challenging to find bugs, but now the situation has become far worse. Multithreaded programs, asynchrony, and multiple cores are now a fact of life. Every execution of these non-deterministic programs is completely different from the last because of different timing of events and thread interleavings. This situation makes reproducing bugs impossible even with a complete log of all input events—something that would be too expensive to record in practice anyway.

Bumpers, Seatbelts and Airbags.

Let’s shift gears for a moment to talk about cars (we’ll get back to talking about software in a minute). As an analogy for the situation we find ourselves in, consider when cars first entered onto the scene. For years, safety was an after-thought at best. When designing new cars, the primary considerations were aesthetics and high performance (think tailfins and V-8 engines).

Eventually, traffic fatalities led legislators and car manufacturers to take safety into account. Seatbelts became required standard equipment in US cars in the late 1960s, bumpers in the 1970s, and airbags in the 1980s. Modern cars incorporate a wide range of safety features, including laminated windshields, crumple zones, and anti-lock braking systems. It is now practically unthinkable that anyone would ship a car without these essential safety features.

However, we routinely ship software with no safety measures of any kind. We are in a position similar to that of the automobile industry of the 1950s, delivering software with lots of horsepower and tailfins, complete with decorative spikes on the steering column to make sure that the user will suffer if their application crashes.

Drunk-Driving Through a Minefield.

The potent cocktail of manual memory management mixed with unchecked memory accesses makes C and C++ applications susceptible to a wide range of memory errors. These errors can cause programs to crash or produce incorrect results. Attackers are also frequently able to exploit these memory errors to gain unauthorized access to systems. Since the vast majority of objects accessed by applications are on the heap, heap-related errors are especially serious.

Numerous memory errors happen when programs incorrectly free objects. Dangling pointers arise when a heap object is freed while it is still live, leading to use-after-free bugs. Invalid frees happen when a program deallocates an object that was never returned by the allocator by inadvertently freeing a stack object or an address in the middle of a heap object. Double frees are when a heap object is deallocated multiple times without an intervening allocation. This error may at first glance seem innocuous but, in many cases, leads to heap corruption or program termination.

Other memory errors have to do with the use of allocated objects. When an object is allocated with a size that is not large enough, an out-of-bound error can occur when the memory address to be read or written lies outside the object. Out-of-bound writes are also known as buffer overflows. Uninitialized reads happen when a program reads memory that has never been initialized; in many cases, uninitialized memory contains data from previously-allocated objects.

Airbags for Your Applications.

Given that we know we will be shipping software with bugs and that the terrain is dangerous, it might make sense to equip it with seatbelts and airbags. What we’d like is to have both resilience and prompt corrective action for any problem that surfaces in our deployed applications.

Let’s focus on C/C++/Objective-C applications—the lion’s share of applications running on servers, desktops, and mobile platforms—and memory errors, the number one headache for applications written in these languages. Safety-equipped memory allocators can play a crucial role in helping to protect your software against crashes.

The Garbage Collection Safety Net.

Dealing with the first class of errors—those that happen because of the misuse of free or delete—can be remedied directly by using garbage collection. Garbage collection works by only reclaiming objects that it allocated, eliminating invalid frees. It only reclaims objects once there is no way to reach those objects anymore by traversing pointers from the “roots”: the globals and the stack. That eliminates dangling pointer errors, since by definition there can’t be any pointers around to reclaimed objects. Since it naturally only reclaims these objects once, a garbage collector also eliminates double frees.

While C and C++ were not designed with garbage collection in mind, it is possible to plug in a “conservative” garbage collector and entirely prevent free-related errors. The word “conservative” here means that because the garbage collector doesn’t necessarily know what values are pointers (since we are in C-land), it conservatively assumes that if a value looks like a pointer (it is in the right range and properly aligned), and it acts like a pointer (it only points to valid objects), then it may be a pointer.

The Boehm-Demers-Weiser conservative garbage collector is an excellent choice for this purpose: it is reasonably fast and space-efficient, and can be used to directly replace memory allocators by configuring it to treat calls to free as NOPs.

Slipping Through the Net.

While garbage collectors eliminate free-related errors, they cannot help prevent the second class of memory errors: those that have to do with the misuse of allocated objects such as buffer overflows.

Runtime systems that can find buffer overflows often impose staggeringly high overheads, making them not particularly suitable for deployed code. Tools like Valgrind’s MemCheck are incredibly comprehensive and useful, but are heavyweight by design and slow execution by orders of magnitude [7].

Compiler-based approaches can reduce overhead substantially by avoiding unnecessary checks, though they entail recompiling all of an application’s code, including libraries. Google has recently made available AddressSanitizer, a combination of compiler and runtime technology that can find a number of bugs, including overflows and use-after-free bugs. While it is much faster than Valgrind, its overhead remains relatively high (around 75%), making it primarily useful for testing.

Your Program Has Encountered An Error. Goodbye, Cruel World.

All of these approaches are based on the idea that the best thing to do upon encountering an error is to abort immediately. This fail-stop behavior is certainly desirable in testing. However, it is not usually what your users want. Most application programs are not safety-critical systems, and aborting them in midstream can be an unpleasant experience for users.

Suppose you have been working on a Microsoft Word document for hours (and for some mysterious reason, auto-save has not been turned on). If Microsoft Word suddenly discovers that some error has occurred, what should it do? It could just pop up the window indicating that something terrible has happened and would you like it to send a note home to Redmond. That might be the best thing to do from a debugging perspective, but most people would prefer that Word do its damndest to save the current document rather than fall on its sword if it discovers a possible error. In short, users generally would prefer that their applications be fault tolerant whenever possible.

Bohr versus Heisenberg.

In fact, the exact behavior users do not want is for an error to happen consistently and repeatably. In his classic 1985 article “Why do computers stop and what can be done about it”, Jim Gray drew a distinction between two kinds of bugs. The first kind are bugs that behave predictably and repeatably—that is, ones that occur every time that the program encounters the same inputs and goes through the same sequence of steps. These are Bohr bugs, by analogy with the classical atomic model where electrons circle around the nucleus in planetary-like orbits. Bohr bugs are great when debugging a program, since it makes it easier to reproduce the bug and find its root cause.

The second kind of bugs are Heisenbugs, meant to connote the inherit uncertainty in quantum mechanics, which are unpredictable and cannot be reliably reproduced. The most common Heisenbugs these days are concurrency errors, a.k.a. race conditions, which depend on the order and timing of scheduling events to appear. Heisenbugs are also often sensitive to the observer effect; attempts to find the bug by inserting debugging code or running in a debugger often disrupt the sequence of events that led to the bug, making it go away.

Jim Gray makes the point that while Bohr bugs are great for debugging, what users want are Heisenbugs. Why? Because a Bohr bug is a showstopper for the user: every time the user does the same thing, they will encounter the same bug. But with Heisenbugs, the bugs often go away when you run the program again. If a program crashes, and the problem is a Heisenbug, then running the program again is likely to work. This is a perfect match for the way users already behave on the Web. If they go to a web page and it fails to respond, they just click “refresh” and that usually solves the problem.

So one way we can make life better for users is to convert Bohr bugs into Heisenbugs, if we can figure out how to do that.

Defensive Driving with DieHard.

My graduate students at the University of Massachusetts Amherst and I, in collaboration with my colleague Ben Zorn at Microsoft Research, have been working for the past few years on ways to protect programs from bugs. The first fruit of that research is a system called DieHard that makes memory errors less likely to impact users. DieHard eliminates some errors entirely and converts the others into (rare) Heisenbugs.

To explain how DieHard works, let’s go back to the car analogy. One way to make it less likely for cars to crash into each other is for them to be spaced further apart, providing adequate braking distance in case something goes wrong. DieHard provides this “defensive driving” by taking over all memory management operations and allocating objects in a space larger than required.

This de facto padding increases the odds that a small overflow will end up in un- allocated space where it can do no harm. However, DieHard doesn’t just add a fixed amount of padding between objects. That would provide great protection against overflows that are small enough, and zero protection against the others. In other words, those overflows would still be Bohr bugs.

Instead, DieHard provides probabilistic memory safety by randomly allocating objects on the heap. DieHard adaptively sizes its heap be a bit larger than the maximum needed by the application; the default is 1/3 [2, 3]. DieHard allocates memory from increasingly large chunks that we call miniheaps.

By randomly allocating objects across all the miniheaps (see this diagram for a detailed view), DieHard makes many memory overflows benign, with a probability that naturally declines as the overflow increases in size and as the heap becomes full. The effect is that, in most cases when running with DieHard, a small overflow is likely to have no effect.

DieHard’s random allocation approach also reduces the likelihood of the free-related errors that garbage collection addresses. DieHard uses bitmaps, stored outside the heap, to track allocated memory. A bit set to ’1’ indicates that a given block is in use, and ’0’ that it is available.

This use of bitmaps to manage memory eliminates the risk of double frees, since resetting a bit to zero twice is the same as resetting in once. Keeping the heap metadata separate from the data in the heap makes it impossible to inadvertently corrupt the heap itself.

Most importantly, DieHard drastically reduces the risk of dangling pointer errors, which effectively go away. If the heap has one million freed objects, the chances that you will immediately reuse one that was just freed is literally one in a million. Contrast this with most allocators, which immediately reclaim freed objects. With DieHard, even after 10,000 reallocations, there is still a 99% chance that the dangled object will not be reused.

Because it performs its allocation in (amortized) constant time, DieHard can provide added safety with very little additional cost in performance. For example, using it in a browser results in no perceivable performance impact.

Tolerating Faults FTW with FTH.

At Microsoft Research, tests with a variant of DieHard resolved about 30% of all bugs in the Microsoft Office database, while having no perceivable impact on performance. Beginning with Windows 7, Microsoft Windows now ships with a Fault-Tolerant Heap (FTH) that was directly inspired by DieHard. 8 Normally, applications use the default heap, but after a program crashes more than a certain number of times, the Fault-Tolerant Heap takes over. Like DieHard, the Fault-Tolerant Heap manages heap metadata separately from the heap. It also adds padding and delays allocations, though it does not provide DieHard’s probabilistic fault tolerance because it does not randomize allocations or deallocations. The Fault-Tolerant Heap approach is especially attractive because it acts like an airbag: effectively invisible and cost-free when everything is fine, but providing protection when they need it.

Exterminating the Bugs.

Tolerating bugs is one way to improve the effective quality of deployed software. It would be even better if somehow the software could not only tolerate faults but also correct them. A follow-on to DieHard, called Exterminator, does exactly that [8, 9]. Exterminator uses a version of DieHard extended to detect errors, and uses statistical inference to compute what kind of error happened and where the error occurred. Exterminator not only can send this information back to programmers for them to repair the software, but it also automatically corrects the errors via runtime patches. For example, if it detects that a certain object was responsible for a buffer overflow of 8 bytes, it will always allocate such objects (distinguished by their call site and size) with an 8-byte pad. Exterminator can learn from the results of multiple runs or multiple users, so it could be used to proactively push out patches to prevent other users from experiencing errors it has already detected elsewhere.

The Future: Safer, Self-Repairing Software.

My group and others (notably Martin Rinard at MIT, Vikram Adve at Illinois, Yuanyuan Zhou at UC-San Diego, Shan Lu at Wisconsin, and Luis Ceze and Dan Grossman at Washington) have made great strides in building safety systems for other classes of errors. We have recently published work on systems that prevent concurrency errors, some of which we can eliminate automatically. Grace is a runtime system that eliminates concurrency errors for concurrent programs that use “fork-join” parallelism. It hijacks the threads library, converting threads to processes “under the hood”, and uses virtual memory mapping and protection to enforce behavior that gives the illusion of a sequential execution, even on a multicore processor [1]. Dthreads (“Deterministic Threads”) is a full replacement for the POSIX threads library that enforces deterministic execution for multithreaded code [6]. In other words, a multithreaded program running with dthreads never has races; every execution with the same inputs generates the same outputs.

We look forward to a day in the not too distant future when such safer runtime systems are the norm. Just as we can now barely imagine cars without their myriad of safety features, we are finally adopting a similar philosophy for software. Buggy software is inevitable, and when possible, we should deploy safety systems that reduce their impact on users.


  1. E. D. Berger, T. Yang, T. Liu, and G. Novark. Grace: safe multithreaded programming for C/C++. In S. Arora and G. T. Leavens, editors, OOPSLA, pages 81–96. ACM, 2009.
  2. E. D. Berger and B. G. Zorn. DieHard: Probabilistic memory safety for unsafe languages. In Proceedings of the 2006 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 158–168, New York, NY, USA, 2006. ACM Press.
  3. E. D. Berger and B. G. Zorn. Efficient probabilistic memory safety. Technical Report UMCS TR-2007-17, Department of Computer Science, University of Massachusetts Amherst, Mar. 2007.
  4. A.Bessey, K.Block, B.Chelf, A.Chou, B.Fulton, S.Hallem, C.Henri-Gros, A.Kamsky, S. McPeak, and D. Engler. A few billion lines of code later: using static analysis to find bugs in the real world. Commun. ACM, 53(2):66–75, Feb. 2010.
  5. M. Hertz and E. D. Berger. Quantifying the performance of garbage collection vs. explicit memory management. In OOPSLA ’05: Proceedings of the 20th annual ACM SIGPLAN conference on Object oriented programming, systems, languages, and applications, pages 313–326, New York, NY, USA, 2005. ACM Press.
  6. T.Liu, C.Curtsinger, and E.D.Berger. Dthreads: Efficient Deterministic Multithreading. In Proceedings of the Twenty-Third ACM Symposium on Operating Systems Principles, SOSP ’11, pages 327–336, New York, NY, USA, 2011. ACM.
  7. N. Nethercote and J. Seward. Valgrind: A framework for heavyweight dynamic binary instrumentation. In Proceedings of 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation, pages 89–100. ACM Press, June 2007.
  8. G. Novark, E. D. Berger, and B. G. Zorn. Exterminator: automatically correcting memory errors with high probability. In Proceedings of the 2007 ACM SIGPLAN Conference on Programming Language Design and Implementation (PLDI), pages 1–11, New York, NY, USA, 2007. ACM Press.
  9. G. Novark, E. D. Berger, and B. G. Zorn. Exterminator: Automatically correcting memory errors with high probability. Communications of the ACM, 51(12):87–95, 2008.
  10. A. Oliner, A. Ganapathi, and W. Xu. Advances and challenges in log analysis. Commun. ACM, 55(2):55–61, Feb. 2012.
  11. Symantec. Internet security threat report., Sept. 2006.
  12. D. Yuan, H. Mai, W. Xiong, L. Tan, Y. Zhou, and S. Pasupathy. Sherlog: error diagnosis by connecting clues from run-time logs. In Proceedings of the fifteenth edition of ASPLOS on Architectural support for programming languages and operating systems, ASPLOS ’10, pages 143–154, New York, NY, USA, 2010. ACM.
  13. D. Yuan, J. Zheng, S. Park, Y. Zhou, and S. Savage. Improving software diagnosability via log enhancement. ACM Trans. Comput. Syst., 30(1):4:1–4:28, Feb. 2012.

Emery Berger's Blog

%d bloggers like this: