Neuroimaging is a crucial tool for both research and clinical neuroscience. A significant challenge in neuroimaging, and in fact all biological sciences, concerns devising ways to manage the enormous amounts of data generated using current techniques. This challenge is compounded by expansion of collaborative efforts in recent years and the necessity of not only sharing data across multiple sites, but making that data available and useful to the scientific community at large.
The need for solutions that facilitate the process of tool and data exchange has been recognized by the scientific community and numerous efforts are underway to achieve this goal (Murphy et al., 2006). To be meaningful, the tools employed and data considered must be adequately described and documented. The metadata describing the origin and subsequent processing of biological images is often referred to as “provenance” (also “lineage” or “pedigree”) (Simmhan et al., 2005).
Provenance in neuroimaging has often been discussed, but few solutions have been suggested. Recently, Simon Miles and Luc Moreau (University of Southampton), and Mike Wilde and Ian Foster (University of Chicago/Argonne National Laboratory) proposed a provenance challenge to determine the state of available provenance systems (Moreau et al., 2007). The challenge consisted of collecting provenance information from a simple neuroimaging workflow (Zhao et al., 2005) and documenting each system’s ability to respond to a set of predefined queries. Some of these existing provenance systems have previously been proposed as mechanisms for capturing provenance in neuroimaging, though they have not been widely adopted (Zhao et al., 2006). The main difficulty appears to be the need of a system to capture provenance information accurately, completely, and with minimal user intervention. Minimizing the burden on the provenance compiler, a comprehensive system will dramatically improve compliance and free the user to focus on performing neuroimaging research rather than collecting provenance.
Provenance can be used for determining data quality, for interpretation, and for interoperability (Simmhan et al., 2005; Zhao et al., 2004). In the biological sciences, provenance about how data was obtained is crucial for assessing the quality and usefulness of information, as well as enabling data analysis in an appropriate context. It is therefore imperative that the provenance of biological images be easily captured and readily accessible. In multiple sclerosis research, for example, increasingly complex analysis workflows are being developed to extract information from large cross-sectional or longitudinal studies (Liu et al., 2005). This is true of Alzheimer’s disease (Fleisher et al., 2005; Mueller et al., 2005; Rusinek et al., 2003), autism (Langen et al., 2007), depression (Drevets, 2000), schizophrenia (Narr et al., 2007), and even studies of normal populations (Mazziotta et al., 1995). The implementation of the complex workflows associated with these studies requires the establishment of quality-control practices to ensure the accuracy, reproducibility, and reusability of the results. In effect, provenance.
Provenance can be divided into two subtypes, data provenance and processing provenance. Data provenance is the metadata that describes the subject, how an image of that subject was collected, who acquired the image, what instrument was used, what settings were used, and how the sample was prepared. However, most scientific image data is not obtained directly from such measurements, but rather derived from other data by the application of computational processes. Processing provenance is the metadata that defines what processing an image has undergone; for example, how the image was skull-stripped, what form of inhomogeneity correction was used, how it was aligned to a standard space, etc. Even data that is presented as “raw” often has been subjected to reconstruction software or converted from the scanner’s native image format to a more commonly used and easily shared file format (Van Horn et al., 2004). A complete data provenance model would capture all this information, making the history of a data product transparent, enabling the free sharing of data across the neuroimaging community.
Some data provenance is typically captured at the site where the data is collected, in the headers of image files or in databases that record image acquisition (Erberich et al., 2007; Martone et al., 2003). An abbreviated form of this kind of provenance is often reported in method descriptions or even in the image files themselves (Bidgood et al., 1997). However, this data is seldom propagated with the images, since it is commonly removed or ignored in the course of file conversion for further processing.
Processing provenance can be collected about any resource in the data processing system and may include multiple levels of detail. Two major models for collecting processing provenance have been described, a process-oriented model (Zhao et al., 2004) and a data-oriented model (Simmhan et al., 2005). The process-oriented model collects lineage information from the deriving processes and provenance is inferred from the processing and by inspection of the input and output data. This mechanism is well suited for situations where individual data products are tracked within comprehensive frameworks and where the deriving processes can easily be reapplied to the original data to reproduce the data product. In the data-oriented model, lineage information is explicitly gathered about the set of data. This method is better suited for situations where data sharing occurs across heterogeneous environments and intermediate data products may not be available for reproduction. This would be the case, for example, when data is shared between two laboratories.
The analysis of raw data in neuroimaging has become a computationally rich process with many intricate steps run on increasingly larger datasets (Liu et al., 2005). Many software packages exist that provide either complete analyses or specific steps in an analysis. These packages often have diverse input and output requirements, utilize different file formats, run in particular environments, and have limited abilities for certain types of data. The combination of these packages to achieve more sensitive and accurate results has become a common tactic in brain mapping studies, but requires much work to ensure valid interoperation between programs. To address this issue we have developed an XML schema (XSD) to guide the production and validation of XML files to capture data and processing provenance and incorporate it into a simple and easy to use environment. In this report we describe this XSD and a simple tool for documenting data provenance and demonstrate that minor differences in compilation can lead to measurable differences in results; reinforcing the need for comprehensive provenance collection. The details and assessment of this provenance schema are also presented.