What: We are developing a tool to track variables and their values in C programs as they change at runtime. Because of the low-level, unrestricted memory access the language allows, the challenge lies in collecting accurate information without disrupting the program’s execution. Our approach combines techniques from profiling with memory tracking to analyze dynamic allocation throughout the lifetime of a program.

Why: Dynamic program analysis provides useful tools for documenting, debugging, and improving the quality of programs. An example is the Daikon tool, which detects likely invariants (logical statements) about variables in programs. Daikon uses separate front-ends to observe the execution of Java, Perl, and C programs, but the current C front-end [2] has limited functionality. The freedom that C gives programmers to control the contents of memory presents a challenge to such automated tools when determining which variables are valid and what they refer to. A robust C front-end for Daikon is important so that it can be used with the abundance of software written in that language. The source-rewriting approach employed by the current C front-end works well only for small, self-contained, and well-behaved programs because it changes the layout of a program’s data structures. We are implementing a new approach that is applicable to a larger class of programs, including those that require access to external libraries. This enhanced support will allow us to compare Daikon-based techniques with ones developed by other researchers using standard benchmarks. It will also support investigations into how dynamic invariant detection can be most useful in practice.

How: Our basic approach is to rewrite the program to be instrumented on the binary level as it is running. We run the program using a supervision framework, such as Valgrind [3] or DynamoRIO [1], so that we can add instructions to basic blocks before they run for the first time. Specifically, we add tracing instructions at each procedure entrance and exit point. Compared to a source-rewriting strategy, this technique reduces the need for changes to the program’s build process. However, static information such as function declarations or compiler-generated debugging symbols can still be used to determine the names and types of variables to be traced.

The main challenge in developing a C language front-end comes from the fact that C is a low-level language that allows the user to manipulate memory contents in a fairly unrestricted manner. For instance, a local variable which is declared as an ‘int *’ might or might not contain useful information: it could be uninitialized or deallocated, and even if valid it might point either to a single integer, or to an array of any size that itself might be incompletely initialized. We need a tool that gives a view of a program based on the language’s abstract semantics, but provides a reasonable result even at times when it would be unsafe for the program itself to use a variable. One approach is to replace pointers and arrays in the source code with “smart pointer” wrapper objects to record usage of dynamically-allocated data. Because this technique changes the layout of memory and requires recording every use of a pointer, it fails when the instrumented code must interact with uninstrumented object code in libraries. (Libraries could be accommodated with hand-written wrapper and summary functions, but this would be prohibitively cumbersome). Since all substantial programs call external libraries, the lack of support for libraries made a previous tool based on this technique impractical for use with realistic-sized programs.

Although we can still use a static (source-based) technique to find variables to examine, we take a new approach for tracking the dynamically-allocated targets of pointers. Again using a dynamic supervision framework, we simply record which bytes of memory the program reads from and writes to. (In fact, the most popular use of Valgrind is a module named Memcheck that helps locate C memory usage bugs). Since this approach directly examines the executable program and not the source code, it can track memory usage within precompiled libraries. The record of what memory has been used is stored separately so it does not disrupt the default layout of data as dictated by the original program (see Figure 1). Potentially unsafe operations such as pointer arithmetic and casting pointers to incompatible types are also permitted under this approach. These operations have the same behavior as they would in the original uninstrumented program since the tracing is concerned only with their ultimate effects on data in memory.

Download pdf Safe Runtime Examination of Data Structures in C Programs