CHESS is an automated concurrency testing tool developed my Microsoft for finding and reproducing Heisenbugs in the concurrent programs. “Heisenbugs” are bugs resulted from unexpected interference among threads, which are extremely difficult to reproduce and eliminate. This type of bugs may show up in systems that may otherwise have been running stable for months and their cause may be as simple as adding some extra debug statements.
When Chess is attached to a concurrent program, it takes complete control over the scheduling of threads and asynchronous events and captures all the interleaving non determinism in the program. This provides a way to find all the execution paths results in an error and can reproduce. The challenge here is not only to integrate CHESS with complex APIs but also to make sure that it does not introduce extra behavior that is not related to the concurrent program.
The most important component of CHESS, the scheduler and the primary goal of the CHESS scheduler is to capture all the nondeterminism during a program execution. This allows the scheduler to later reproduce a chosen concurrent execution by replaying these nondeterministic choices. It is implemented by redirecting calls to concurrency primitives to alternate implementations from a wrapper library. Including an implementation of these primitives as part of the program is a design choice that may simplify the schedule which now needs to understand only simpler system primitives that the lock implementation uses. However the tradeoff with this approach is that it does not allow the scheduler to expose all the nondeterminism in the lock implementation. Another design decision meant to simplify the implementation of the scheduler is to abstract the execution of the concurrent program using Lamport’s happens-before graph, which is a graph that captures the relative execution order of threads in a concurrent program. This approach offers a common framework for reasoning about all synchronization primitives used in the program which abstracts away the timing of instructions in the execution.
One more important piece of chess is dealing with data races uses single threaded execution. This may slow down the execution a bit but it can be compensated by running multiple instances in parallel, each exploring a different part of the interleaving state space.
I think this is very good tool to capture bugs that are uncaught by stress testing and it is also tracking where the bugs are accruing in the execution and this make the tool very valuable. It will be nice to try out the tool in future.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment