How is Provenance processed?
By documenting the executable provenance of each module and the workflow itself as workflow provenance, any workflow application can become a mechanism for capturing processing provenance. We have used a combination of an executable provenance XSD with LONI Pipeline workflows to capture processing provenance and description. One of the real strengths of this system is the capacity to easily recreate the processing applied to a file by viewing its provenance file, extracting the workflow, and rerunning it in the LONI Pipeline.
What is Workflow Provenance?
Binary provenance describes the creation of a binary, but workflow provenance describes the actual invocation of that individual binary or the invocation of a binary in the context of a series of steps or a workflow. In its simplest form, the workflow provenance XSD captures the names of the data files used as input and output, the options used to invoke the single binary, and the environment in which the binary was run. Arguments to the binary are captured by recording the command-line that was used to invoke the binary. The processing environment is described similarly to the environment for compilation, but also includes environmental variables that may modify the behavior of the binary. For example, the FSL tools (Smith et al., 2004) use an environment variable called “FSLOUTPUTTYPE” to define the file format of the output image.
Often image processing is complex and non-linear and cannot be represented in a simple script or directed acyclic graph. Data may converge along several lines of processing only to diverge again after a common step. These complex workflows are difficult to document, either for publication or later re-use. Capturing the provenance for these workflows is equally complex, not only requiring the execution order of the individual steps, but how those steps are related to one another, especially in the case of multiple lines of data being processed simultaneously. In order to address this issue we have used the LONI Pipeline Processing Environment (LONI Pipeline; http://pipeline.loni.usc.edu) (Rex et al., 2003) to capture workflow provenance and description.
The LONI Pipeline is a simple, efficient, and distributed computing solution to these problems, enabling software inclusion from different laboratories in different environments. It provides a visual programming interface for the design, execution, and dissemination of neuroimaging analyses. Individual executables are represented as “modules” that can be added, deleted, and substituted for other modules within a graphical user interface. Connections between the modules that establish an analysis methodology are represented by a “workflow”. The environment handles bookkeeping, controls the details of the computation, and information transfer between modules and within the workflow. It allows files, intermediate results, and other information to be accurately passed between individually connected modules. Modules and workflows can be saved to disk at any stage of development and recalled at a later time for modification, use, or distribution.
Module Descriptions Individual modules are representations of programs that are present on a LONI Pipeline server (which can also be the client). The modules are XML descriptions of the executables and how they should be invoked from a command-line interface. The module contains text descriptions of all the arguments for the program and a description of the program itself (see figure 2 on the left). Using the LONI Pipeline as an example of workflow software, we have designed the provenance framework to take advantage of context information that can only be kept while using workflow software. Specifically, the use of conditionals between executables, and loops can all be represented in a higher workflow language and associated with a series of executable events in the provenance. More generally, we want to be able to track how data is derived with sufficient precision that one can create or recreate it from this knowledge. Workflow provenance can then be added to the provenance XML file by copying the entire workflow into the file (http://www.loni.usc.edu/Software/NI_Provenance).
How is Provenance Executed?
The approach we used to document provenance combines both the data-oriented model and the process-oriented model. In the process-oriented model, binary provenance describes how a piece of software was compiled. It is comprised of two parts, a description of the environment and a description of the binary itself. The environment description includes the operating system, environment variables, compiler used, and libraries installed. The binary description includes configuration flags and/or modifications made to configuration or makefiles. Our goal is to provide the user with the ability to reproduce not only the binary, but the environment in which it was run.
A fundamental difference between executables is the hardware platform on which they were compiled. Differences in floating-point performance across different architectures can have a profound impact on outcome of a calculation and have been widely publicized in the popular media (Halfhill, 1995). The XSD captures not only architecture, but also the specific processor and the flags that are enabled on it.
Capturing pertinent details about the operating system is complicated, especially for Linux distributions, since each distribution contains many individually updated components. Essential information must be captured such as the operating system name, version, distribution, kernel name, and kernel version. For example, an application running on Ubuntu Dapper Drake must have the following operating system metadata: Linux, 6.06, Ubuntu Desktop, #1 PREEMPT, 2.6.15-27-386; whereas an application built on the Mac OS X Leopard platform must have the following operating system metadata: Mac OS, 10.5.1, n/a, Darwin, 9.10.0.
The compiler used and libraries linked during compilation are a crucial aspect of the environment. In addition to compiler name and version, a list of which updates have been applied is also captured. This section of the provenance metadata also records which flags were used when the compiler was invoked, architecture and optimization flags being of special interest. Libraries used for compilation are described similarly to the binary itself and are recursive. That is to say that a library that is in turn linked to other libraries are also captured in the library’s provenance.
Binaries also can be configured prior to compilation. Some packages are distributed in a format for use with the GNU build system or Autotools (Vaughan, 2000). Modification of the configure script or the makefile can yield substantially different results after compilation. The provenance XSD captures flags to the configure script, modifications to configure scripts and makefiles.
The concept of provenance can extend to knowledge of the behavior of executables, such as describing their function. The Brain Surface Extractor (BSE) (Shattuck and Leahy, 2002), the Brain Extraction Tool (BET) (Smith, 2002), and MRI Watershed (Dale et al., 1999) are all brain extraction algorithms, however, their internal functions may not be evident to a naive user, especially since they are commonly referred to by their abbreviations. This information, in addition to a short description of the executable, is also captured in the XSD and may be added to provenance XML files.
Executable provenance need only be collected once, when a binary is compiled or when a script is written. It must then be collected and recorded manually, then appended to the provenance XML. The LONI Pipeline is currently being extended to store and display executable provenance, eliminating the need for manual file editing in the future.
Provenance in Neuroimaging
Allan J. MacKenzie-Graham, John D. Van Horn, Roger P. Woods, Karen L. Crawford, Arthur W. Toga
Neuroimaging Data Provenance Using the LONI Pipeline Workflow Environment
Allan J. MacKenzie-Graham, Arash Payan, Ivo D. Dinov, John D. Van Horn, Arthur W. Toga