Scripts Provenance

Astro-WISE

Astro-WISE (mwebaze2009a, mwebaze2011a) is a python framework that astronomers can use to collect and integrate provenance in their tools and experiments. Thus, in order to capture provenance with Astro-WISE, it is necessary to instrument the code.

Astro-WISE uses a object oriented programming approach for capturing provenance. It defines descriptors that can be used within classes to declare attributes. These descriptors specify that the framework should trace the lineage of the attributes. To do so, the descriptors must declare dependencies to other traced classes or to file accesses and paramenters. Astro-WISE enforces the immutability of the attributes and uses a unique object identifier and version for them. Thus, it allows some basic evolution tracking. Instead of comparing the source code for versioning, Astro-WISE compares the bytecode (mwebaze2011a). Each class keeps a reference to its predecessors for versioning. Each class in Astro-WISE must specify a make method that specifies file targets, their dependencies, and transformation comands that create the derived product.

Astro-WISE supports the re-computation of targets by checking which dependencies have changed. It also supports comparing the dependencies of two targets as lineage trees to check what has changed.

Astro-WISE presents the dependencies of a target file as a textual lineage tree in a Web visualization tool. This tree is equivalent to a data-view graph with nodes representing data. In addition to the visualization tool, Astro-WISE provides python functions as an extended query language to query the database. These functions are internally transformed into SQL queries for the relational database

@inproceedings{mwebaze2011a,
  address = {Stockholm, Sweden},
  author = {Mwebaze, Johnson and Boxhoorn, Danny and Valentijn, Edwin},
  booktitle = {IEEE International Conference on e-Science},
  pages = {263--270},
  publisher = {IEEE},
  title = {{D}ynamic {P}ipeline {C}hanges in {S}cientific {D}ata {P}rocessing},
  year = {2011}
}

@inproceedings{mwebaze2009a,
  address = {Indianapolis, USA},
  author = {Mwebaze, Johnson and Boxhoorn, Danny and Valentijn, Edwin},
  booktitle = {International Conference on Network-Based Information Systems},
  pages = {475--480},
  publisher = {IEEE},
  title = {{A}stro-wise: {T}racing and using lineage for scientific data processing},
  year = {2009}
}

Becker and Chambers

Becker and Chambers (becker1988a) propose a facility for S that captures provenance for comprehension and validating reported results. They modified S interpreter for supporting execution provenance collection. Thus, it applies the overriding strategy. During provenance collection, they produce an audit file that contains all of the statements evaluated with a list of objects that were referred to or were assigned a value in the statements.

For analysis, they read the audit file and produce a data structure that allows users to run lineage queries interactively for understanding what happened during the data analysis section. Becker and Chambers (1988) support functions for running these queries. They also support plotting relationships between statements as a process-centric graph. .

The provenance can also be used to generate an executable script that incorporates statements needed to produce a specied list of statements. Becker and Chambers (1988) collect random number generators by including artificial statements in the audit file that specifies the seed. .

@article{becker1988a,
  author = {Becker, Richard A and Chambers, John M},
  journal = {SIAM Journal on Scientific and Statistical Computing},
  number = {4},
  pages = {747--760},
  publisher = {SIAM},
  title = {{A}uditing of data analyses},
  volume = {9},
  year = {1988}
}

Bochner, Gude, and Schreiber

Bochner, Gude, and Scheireber (bochner2008a) propose a library to collect provenance in Python scripts to check for compliance with applicable regulations. This library requires users to annotate what and how they want to capture. Thus, it applies the instrumentation strategy for gathering execution provenance. These annotations can also be used for gathering definition and deployment provenance dynamically.

During user-defined provenance collection, the library connects to remote storage services for storing provenance as XML documents. Hence, it supports provenance querying with XQuery and XPath .

@inproceedings{bochner2008a,
  address = {Salt-Lake City, USA},
  author = {Bochner, Carsten and Gude, Roland and Schreiber, Andreas},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {229--240},
  publisher = {Springer},
  title = {{A} python library for provenance recording and querying},
  year = {2008}
}

CPL

Core Provenance Library (CPL) (macko2012a) is a library with implementations for C, C++, Perl, Java, Python, and R that programmers and scientists can use to integrate provenance in their tools and experiments.

CPL provides functions for initializing the library; connecting to the provenance storage backend ; creating provenance objects ; accessing existing objects; creating dataflow for objects that use data from others ; creating control flow for objects that influence others without passing data ; attaching properties to the object ; finishing the execution to store provenance. It also provides similar functions to deal with shared objects such as files and functions to collect deployment provenance.

When CPL creates a provenance object, it assigns two identifiers to it: an object ID for identification in the machine, and the MAC address of the machine to enable sharing objects with provenance over the network. Additional to these identifiers, CPL annotates all provenance records with a trial identifier to track provenance evolution.

CPL can store provenance on either on relational databases or on a graph database. It supports MySQL and PostgreSQL as relational databases and 4store as a graph database. For analysis, it provides functions for accessing the provenance and supports provenance queries in SPARQL and SQL.

@inproceedings{macko2012a,
  address = {Boston, MA, USA},
  author = {Macko, Peter and Seltzer, Margo},
  booktitle = {Workshop on the Theory and Practice of Provenance},
  pages = {1--6},
  publisher = {USENIX},
  title = {{A} General-{P}urpose {P}rovenance {L}ibrary},
  year = {2012}
}

CXXR

CXXR (runnalls2012a silles2010a) is a project that aims to remake fundamental parts of the standard R interpreter from C into C++, and enhance it with R data objects and provenance tracking. The goal of provenance tracking is to provide an understanding of object derivations by providing functions to obtain variables ancestors (i.e. which objects it depends on), descendants (i.e. which objects depends on it), and pedigree (i.e. the sequence of commands that led to its binding). Since it changes the standard R interpreter, it applies the overriding strategy.

For provenance tracking, it attaches read and write monitors to R global environments. Reading, creating or overwriting variable bindings in these environments trigger the monitors, which collect timestamp, expressions and children bindings of variable bindings. A binding is a child of another when it uses its parent value during its evaluation. CXXR adds every new variable binding into a set of variables that the interpreter saw. This way, it avoids repeating provenance collection in loops.

CXXR considers that top-level commands are provenance processes, and variables bindings are provenance entities. Hence, it does not consider the result of function calls as provenance entities. However, for some functions whose behavior is not is not fully define (e.g. reading user input or file content), CXXR creates the resulting value as a binding. Yet, as CXXR tracks only top-level provenance in the global environment, it is possible to evade provenance through local environments in stateful function calls.

For provenance analysis, CXXR allows users to call functions during the program execution, as it does not store provenance. Silles and Runnalls (2010) describe two provenance retrieval functions: provenance(x) returns the timestamp of the binding x, the expression responsible for its current state and a list of both ancestors and descendants; and pedigree(x) returns the full sequence of commands that influenced the current binding of x, with all ancestors sorted by binding creation.

@inproceedings{runnalls2011a,
  address = {Miama Beach, FL, USA},
  author = {Runnalls, Andrew R and Silles, Chris A},
  booktitle = {Joint Statistical Meetings},
  pages = {1--9},
  publisher = {AMSTAT},
  title = {C{X}XR: {A}n ideas hatchery for future {R} development},
  year = {2011}
}

@inproceedings{silles2010a,
  address = {Troy, NY, USA},
  author = {Silles, Chris A and Runnalls, Andrew R},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {64--72},
  proceedings = {LNCS 6378},
  publisher = {Springer},
  title = {{P}rovenance-awareness in {R}},
  year = {2010}
}

@phdthesis{silles2014a,
  address = {University of Kent},
  author = {Silles, Christopher Anthony},
  school = {University of Kent},
  title = {{P}rovenance-aware {C}XX{R}},
  year = {2014}
}

@article{runnalls2011b,
  author = {Runnalls, Andrew R},
  journal = {Computational Statistics},
  number = {3},
  pages = {427--442},
  publisher = {Springer},
  title = {{A}spects of {C}XX{R} internals},
  volume = {26},
  year = {2011}
}

@inproceedings{runnalls2012a,
  address = {Santa Barbara, CA, USA},
  author = {Runnalls, Andrew and Silles, Chris},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {237--239},
  publisher = {Springer},
  title = {{P}rovenance tracking in {R}},
  year = {2012}
}

Datatrack

Datatrack ( eichinski2016a ) is a package that collects provenance from R scripts to support their management . It provides wrapper functions for accessing files that allows users to instrument the code, describing dependencies and annotating data objects.

Instead of using the default R functions for opening files, users need to use Datatrack ones for capturing provenance. Thus, it uses inclusive, internal, executable annotations. Such functions receive the file name, a list of dependencies, parameters, and optional annotations that target provenance as arguments. Users can use arbritary dependencies and annotations to describe the experiment. All dependencies are treated as data dependencies and connected by their names. These wrapper functions also collect the date that the data object were accessed, and a stack trace of function calls. In addition to the execution provenance, Datatrack also records the script paramenters, and deployment information, such as the operation system version, R version, and loaded packages with their versions.

Datatrack stores provenance and versioning metadata in a CSV file under the project directory.

For analysis, datatrack produces a data-centric graph with all files and user-declared dependencies. This graph presents nodes representing data and edges representing dependencies. In a single data node, it presents multiple trial versions.

@inproceedings{eichinski2016a,
  address = {Baltimore, Maryland, USA},
  author = {Eichinski, Philip and Roe, Paul},
  booktitle = {IEEE International Conference on e-Science},
  pages = {1--8},
  publisher = {IEEE},
  title = {{D}atatrack: {A}n {R} package for managing data in a multi-stage experimental workflow},
  year = {2016}
}

ES3

Earth System Science Server (ES3) (frew2008a; valeur2005a) captures provenance for comprehending Earth science data products. It collects execution provenance from binary files through diverse strategies. For scripts, it supports either monitoring the interpreter binary or capturing directly from scripts through alternative plugins that support instrumentations. ES3 applies three strategies for execution provenance collection: passive monitoring, overriding and instrumentation.

ES3 uses a probulator to monitor transparently the execution. The probulator comprises two applications: a logger that instruments, monitors and logs the execution; and a transmitter that transmits the collected provenance for storage.

The logger uses plugins for capturing provenance. Frew et al. (2008) propose two plugins for ES3. The default one uses strace to trace system calls in a process and capture the process name, its arguments, the sequence of children processes, accessed files, and standard input, output, and errors. Since system call traces can be overwhelming, users can configure filter files with patterns that can be safely ignored during the collection, such as shared library accesses. Hence, the default plugin uses both the passive monitoring and the instrumentation strategy. The second plugin is specific to IDL. It preprocesses IDL scripts, replacing certain calls to IDL built-in functions and inserts calls to ES3 for logging. Thus, the second plugin uses the overriding strategy for provenance collection. Both loggers produce files with provenance as results.

After logging the provenance, the transmitter parses the files produced by the loggers; assigns universally unique identifier to every provenance object; converts these provenance objects into standard ES3 execution reports in XML and transfer these XML documents to a web server that stores them in an XML database.

For analysis, ES3 supports querying ancestors and descendants of provenance objects with XQuery. Additionally, it supports producing provenance graphs in GraphML format. The visualizations produce by ES3 combine both data files, arguments, and transformations as nodes.

@inproceedings{frew2010a,
  address = {Troy, NY, USA},
  author = {Frew, James and Jan{\'e}e, Greg and Slaughter, Peter},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {27--33},
  publisher = {Springer},
  title = {{A}utomatic {P}rovenance {C}ollection and {P}ublishing in a {S}cience {D}ata {P}roduction {E}nvironment -- {E}arly {R}esults},
  year = {2010}
}

@inproceedings{frew2004a,
  address = {Palo Alto, CA},
  author = {Frew, James},
  booktitle = {Earth Science Technology Conference},
  link = {https://esto.ndc.nasa.gov/conferences/estc2004/papers/a4p3.pdf},
  pages = {1--5},
  publisher = {NASA},
  title = {{E}arth {S}ystem {S}cience {S}erver ({E}S3): {L}ocal {I}nfrastructure for {E}arth {S}cience {P}roduct {M}anagement},
  year = {2004}
}

@phdthesis{valeur2005a,
  author = {Valeur, H{\aa}var},
  link = {http://urn.ub.uu.se/resolve?urn= urn:nbn:no:ntnu:diva-1341},
  pages = {91},
  publisher = {Institutt for datateknikk og informasjonsvitenskap},
  school = {Norwegian University of Science and Technology, Trondheim},
  title = {{T}racking the lineage of arbitrary processing sequences},
  year = {2005}
}

@inproceedings{frew2011a,
  address = {Portland, OR, USA},
  author = {Frew, James and Jan{\'e}e, Greg and Slaughter, Peter},
  booktitle = {Scientific and Statistical Database Management},
  link = {http://ipaw2012.bren.ucsb.edu/images/3/39/68090244.pdf},
  pages = {244--252},
  publisher = {Springer},
  title = {{P}rovenance-enabled automatic data publishing},
  year = {2011}
}

@article{frew2008a,
  author = {Frew, James and Metzger, Dominic and Slaughter, Peter},
  doi = {10.1002/cpe.1247},
  journal = {Concurrency and Computation: Practice and Experience},
  link = {http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.559.9467\&rep=rep1\&type=pdf},
  number = {5},
  pages = {485--496},
  publisher = {Wiley Online Library},
  title = {{A}utomatic capture and reconstruction of computational provenance},
  volume = {20},
  year = {2008}
}

@inproceedings{frew2008b,
  address = {Salt Lake City, UT, USA},
  author = {Frew, James and Slaughter, Peter},
  booktitle = {International Provenance and Annotation Workshop},
  link = {https://pdfs.semanticscholar.org/e16f/7aa1b63e29255a2fe8f9fb1159fb50f90e61.pdf},
  pages = {200--207},
  publisher = {Springer},
  title = {{E}s3: {A} demonstration of transparent provenance for scientific computation},
  year = {2008}
}

ESSW

Earth System Science Workbench (ESSW) (frew2001a) uses provenance for describing scientific experiments and providing a better comprehension and data management for them. It proposes wrapping experiments in instrumented Perl scripts for provenance collection. Thus, it uses the instrumentation strategy to collect provenance.

ESSW provides a set of wrapper functions in Perl. These wrappers manipulate science objects log metadata. ESSW provides functions to create science objects based on files, functions to link them to others, and functions to store them in an XML document during execution. ESSW transforms the links between these objects into lineage graphs. Hence, users are required to annotate their code specifying what and how to collect provenance.

ESSW parses the XML document created by a trial and store science objects and relationships in a relational database (MySQL). For file objects, ESSW stores their MD5 hashes in the relational database and their content in a separate content database.

After storing the science objects in the database, it is possible to perform SQL queries over them for analysis. Additionally, ESSW provides a web interface for querying metadata of science objects and visualizing their lineage graph. It uses GraphViz to produce a static combined view that presents both data files and experiment processes as nodes.

@inproceedings{frew2001a,
  address = {Fairfax, VA, U.S.A},
  author = {Frew, James and Bose, Rajendra},
  booktitle = {Scientific and Statistical Database Management},
  doi = {10.1109/SSDM.2001.938550},
  pages = {180--189},
  publisher = {IEEE},
  title = {{E}arth system science workbench: {A} data management infrastructure for earth science products},
  year = {2001}
}

IncPy

IncPy (guo2010a, guo2011b) collects provenance to support cache invalidation. It modifies the Python interpreter to enhance Python with automatic provenance collection and memoization of long running function executions. When the modified interpreter invokes a function that has a memoized result, it observes the function provenance to check whether the function has the same arguments and dependencies of the memoized results. The interpreter uses the memoized results only when it safe to do so. Since IncPy modifies how the interpreter handles functions for memoization, it collects provenance through the overriding strategy.

For each function execution, IncPy collects the function name, function definition, passed arguments, global variables, input files, code dependencies (e.g. functions executed inside the cached one), output files, return values, terminal output and function duration. As function definition, it collects the function bytecode to avoid spacing, commentaries, and other cosmetic changes to break memoization. It uses cPickle to serialize entire arguments, global variables, and return values recursively.

It uses the function name, function definition, passed arguments, global variables, input files, and code dependencies to identify the memoized value. Thus, if an argument, global variable or input file changes, IncPy will not use the memoized results. If the user changes the memoized function definition or the definition of any function that has the memoized function in their stack (e.g. a function called by the memoized one), IncPy automatically deletes previous cache entries. When IncPy uses memoized values for a function, it makes the function just copy output files, print terminal output cache and return cached return values.

IncPy only memoizes deterministic pure functions (i.e. functions that do not change global states and always produce the same results for the same input). It determines whether a function is pure or not by tracking where variables were defined and collecting dependencies of global variables, and arguments. Note that a call can be considered pure for some arguments and impure by others. IncPy considers calls that access random number generators and system clock as non-deterministic.

Since it captures provenance with the Python interpreter, IncPy cannot track impurity, dependencies or non-determinism in C/C++ extensions and external executables. Additionally, it does not handle non-determinism from network accesses. Thus, it allows users to annotate which functions they always or never want to memoize.

IncPy stores the memoization when the function is about to exit. It allows users to continue interrupted executions. At the function exit moment, it collects the function duration and checks if it took more than one second to run. IncPy only caches functions that take more than one seconds to run. Additionally, it tracks the time it takes to store the cache on the disk. If it is longer than the call duration, it assumes that it is faster to compute the function again than to use the cached results, and it removes the cache.

@inproceedings{guo2010a,
  address = {Troy, NY, USA},
  author = {Guo, Philip J and Engler, Dawson R},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {1--10},
  publisher = {Springer},
  title = {{T}owards {P}ractical {I}ncremental {R}ecomputation for {S}cientists: {A}n {I}mplementation for the {P}ython {L}anguage},
  year = {2010}
}

@phdthesis{guo2012c,
  address = {Stanford University},
  author = {Guo, Philip Jia},
  school = {Stanford University},
  title = {{S}oftware tools to facilitate research programming},
  year = {2012}
}

@inproceedings{guo2011b,
  address = {Toronto, ON, Canada},
  author = {Guo, Philip J and Engler, Dawson},
  booktitle = {ACM International Symposium on Software Testing and Analysis},
  pages = {287--297},
  publisher = {ACM},
  title = {{U}sing automatic persistent memoization to facilitate data analysis scripting},
  year = {2011}
}

Lancet

Lancet (stevens2013a) is a Python library that supports defining declarative specifications for experiments and collects provenance of executions for reproducing the experiments. Thus, users need to instrument and rewrite their code to collect provenance with Lancet.

With Lancet, users declare arguments, commands, and launchers. Arguments express what they aim to achieve, with the platform and tool-independent specifications. An argument specification can define multiple arguments sets for parameter sweeping. Commands express how they aim to achieve, with platform-independent, but tool-dependent specifications. A command specification handles the interface to an external tool and supports defining how to run the tool on multiple platforms. Finally, launchers express where they want to execute the task with platform-dependent, but tool-independent notations. A launcher specification combines arguments to commands and launch jobs on the desired platform. Lancet provides launchers to run locally or in parallel on a Grid Engine.

When the launchers execute, Lancet records the Python version, the Lancet version, operating system information and other useful metadata as deployment provenance. In addition, Lancet records the launcher representation along with argument data as definition provenance. The launcher representation is enough for recreating it for reproducibility.

Lancet stores the collected provenance to a file. Lancet allows users to include annotations with library versions, comments and metadata descriptions in the provenance file. Lancet also offers a function to help writing version control information to this file, maintaining an explicit log of all parameters used in the experiment history. For provenance analysis, users can inspect the provenance file.

@article{stevens2013a,
  author = {Stevens, Jean-Luc Richard and Elver, Marco and Bednar, James A},
  journal = {Frontiers in Neuroinformatics},
  link = {http://journal.frontiersin.org/article/10.3389/fninf.2013.00044/full},
  number = {44},
  pages = {44},
  publisher = {Frontiers},
  title = {{A}n automated and reproducible workflow for running and analyzing neural simulations using {L}ancet and {I}Python {N}otebook},
  volume = {7},
  year = {2013}
}

Magni

Magni (oxvig2016a) is a library that captures provenance from Python scripts for reproducibility through the instrumentation strategy.

It requires users to instrument the code with functions specifying which data should it collect. In addition to the user-defined data, when Magni creates a database for a trial, it collects the git revision number, the datetime, the information about Conda environment (optional), information about magni itself, and about the platform. Additionally, it captures the main source code and the stack trace.

Magni supports storing the provenance either in a JSON file or in a HDF5 database.

@inproceedings{oxvig2016a,
  address = {Austin, TX, USA},
  author = {Oxvig, Christian Schou and Arildsen, Thomas and Larsen, Torben},
  booktitle = {Python in Science Conference},
  pages = {45--50},
  publisher = {SciPy},
  title = {{S}toring {R}eproducible {R}esults from {C}omputational {E}xperiments using {S}cientific {P}ython {P}ackages},
  year = {2016}
}

Michaelides et al.

Michaelides et al. (michaelides2016a) capture provenance from scripts expressed visually using Blockly for reproducibility. They store the provenance in Intermediate Notation for Provenance and Workflow Reproducibility (INPWR) format. INPWR is a format based on PROV that can turn back into Blockly visual scripts for re-execution.

They apply the overriding strategy by augmenting functions that execute blocks in the StarJR interpreter. During the collection, they collect executed blocks in Blockly as tasks with their start and end time, type (e.g. sequence and if expression), value, and value type. Since executed blocks produce provenance tasks, this approach unwinds loop structures. With the reproducibility goal, Michaelides et al. (2016) replace user input blocks with literal values that represent the input made.

In addition to transforming INPWR into executable and editable Blockly, Michaelides et al. (2016) also use the PROV-Template system to convert INPWR into PROV.

@inproceedings{michaelides2016a,
  address = {McLean, VA, USA},
  author = {Michaelides, Danius T and Parker, Richard and Charlton, Chris and Browne, William J and Moreau, Luc},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {83--94},
  publisher = {Springer},
  title = {{I}ntermediate {N}otation for {P}rovenance and {W}orkflow {R}eproducibility},
  year = {2016}
}

noWorkflow

noWorkflow (murta2014a; pimentel2016a, pimentel2016b, pimentel2015a, pimentel2017a) captures provenance from Python scripts for comprehension. It requires no changes in the scripts for provenance collection. noWorkflow collects deployment, definition, and execution provenance.

noWorkflow collects environment variables and uses the importlib to collect imported modules as deployment provenance. It traverses the AST to collect script and functions as definition provenance. For execution provenance, noWorkflow applies the overriding and the passive monitoring strategies. It overrides the open function to capture all file accesses, and it traces the execution with a Profiler to obtain executed functions with parameters and return values. Optionally, noWorkflow also supports reading the bytecode and defining a Tracer for collecting variables and dependencies.

In addition to the trial provenance, noWorkflow also tracks the trial evolution by associating each trial provenance with an identifier and using this identifier to track which trials were based on it. noWorkflow supports restoring previous trials and visualizing the history.

noWorkflow stores files in a content database structured by SHA1 hash codes and provenance in an SQLite database. Thus, it supports SQL queries for analysis. In addition to SQL queries, noWorkflow provides a series of command lines for listing and comparing trials, activations, modules and environment variables. It also provides a command to export a trial provenance to Prolog with predefined rules, allowing users to run Prolog queries. Additionally, noWorkflow provides a web visualization tool that presents an activation graph. The activation graph presents the sequence of activations in a trial. For fine-grained provenance visualization, noWorkflow exports a dataflow as a dot file that presents all variable dependencies.

noWorkflow integrates to IPython, defining diverse IPython magics for provenance collection and querying in IPython cells. In this integration, noWorkflow also implements methods for visualizations in Jupyter. Thus, it is possible to load noWorkflow objects for visualizing activation and dataflow graphs.

@inproceedings{murta2014a,
  address = {Cologne, Germany},
  author = {Murta, Leonardo and Braganholo, Vanessa and Chirigati, Fernando and Koop, David and Freire, Juliana},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {71--83},
  publisher = {Springer},
  title = {no{W}orkflow: capturing and analyzing provenance of scripts},
  year = {2014}
}

@inproceedings{pimentel2015a,
  address = {Edinburgh, Scotland},
  author = {Pimentel, Jo{\~a}o Felipe Nicolaci and Braganholo, Vanessa and Murta, Leonardo and Freire, Juliana},
  booktitle = {Workshop on the Theory and Practice of Provenance},
  pages = {1--6},
  publisher = {USENIX},
  title = {{C}ollecting and analyzing provenance on interactive notebooks: when {I}Python meets no{W}orkflow},
  year = {2015}
}

@inproceedings{pimentel2016a,
  address = {McLean, VA, USA},
  author = {Pimentel, Jo{\~a}o Felipe and Freire, Juliana and Braganholo, Vanessa and Murta, Leonardo},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {16--28},
  publisher = {Springer},
  title = {{T}racking and analyzing the evolution of provenance from scripts},
  year = {2016}
}

@article{pimentel2017a,
  author = {Pimentel, Joao Felipe and Murta, Leonardo and Braganholo, Vanessa and Freire, Juliana},
  journal = {Very Large Data Bases},
  number = {12},
  pages = {1841--1844},
  publisher = {VLDB Endowment},
  title = {no{W}orkflow: a tool for collecting, analyzing, and managing provenance from python scripts},
  volume = {10},
  year = {2017}
}

@inproceedings{pimentel2016b,
  address = {McLean, VA, USA},
  author = {Pimentel, Jo{\~a}o Felipe and Freire, Juliana and Murta, Leonardo and Braganholo, Vanessa},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {199--203},
  publisher = {Springer},
  title = {{F}ine-grained provenance collection over scripts through program slicing},
  year = {2016}
}

Provenance Curious

Provenance Curious (huq2013a, huq2013b, huq2013c) collects the definition provenance from Python scripts for comprehending the experiment. It traverses the AST and applies graph rewriting rules to generate a summarized dataflow graph. Then, it applies the post mortem strategy to infer provenance through probabilistic models.

After extracting the graph, it applies graph-rewriting rules to reduce the number of nodes and edges and supports user customizations for grouping processes. Provenance curious stores the resulting graph in an SQLite database and provides an inference engine that allows users to input values and debug the model to check if it represents the provenance. It also generates GraphML files for visualization and distribution.

@inproceedings{huq2013a,
  address = {Genoa, Italy},
  author = {Huq, Mohammad Rezwanul and Apers, Peter MG and Wombacher, Andreas},
  booktitle = {International Conference on Extending Database Technology},
  pages = {765--768},
  publisher = {ACM},
  title = {{P}rovenance{C}urious: a tool to infer data provenance from scripts},
  year = {2013}
}

@article{huq2013c,
  address = {Washington, DC, USA},
  author = {Huq, Mohammad Rezwanul and Apers, Peter MG and Wombacher, Andreas},
  journal = {IEEE Transactions on Geoscience and Remote Sensing},
  number = {11},
  pages = {5113--5130},
  publisher = {IEEE},
  title = {{A}n inference-based framework to manage data provenance in {G}eoscience {A}pplications},
  volume = {51},
  year = {2013}
}

@phdthesis{huq2013b,
  author = {Huq, Mohammad Rezwanul},
  school = {University of Twente},
  title = {{A}n inference-based framework for managing data provenance},
  year = {2013}
}

pypet

pypet (meyer2016a) is a Python library that supports defining declarative parameter specifications for experiments and manages the experiment execution for finding the best parameter combination. Thus, users need to instrument and rewrite their code to collect provenance with pypet.

With pypet, users declare trajectories with parameters and use a experiment environment to execute a function with such parameters. It collects the execution results for each input combination. Pypet also integrates with Sumatra or standalone version control systems to track the evolution and the definition provenance of experiments.

Pypet stores the collected provenance in HDF5 files.

@article{meyer2016a,
  author = {Meyer, Robert and Obermayer, Klaus},
  journal = {Frontiers in Neuroinformatics},
  link = {https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4996826/},
  pages = {1--16},
  publisher = {Frontiers Media SA},
  title = {pypet: {A} Python {T}oolkit for {D}ata {M}anagement of {P}arameter {E}xplorations},
  volume = {10},
  year = {2016}
}

@article{meyer2015a,
  author = {Meyer, Robert and Obermayer, Klaus},
  journal = {Neuroscience},
  number = {Suppl 1},
  pages = {P184},
  publisher = {BioMed Central},
  title = {pypet: a python toolkit for simulations and numerical experiments},
  volume = {16},
  year = {2015}
}

RDataTracker

RDataTracker (lerner2014a, lerner2018a) is an R library that collects data provenance for comprehension in R scripts or console sessions. RDataTracker traces the execution and collects variables and statements dependencies. It combines the passive monitoring and the overriding strategies for execution provenance collection. RDataTracker stores provenance in PROV-JSON files.

The result of provenance collection in RDataTracker is a data derivation graph that presents procedural and data nodes. Procedural nodes represent start and end of procedural blocks and operational steps (i.e. statements). On the other hand, data nodes represent simple data, files, URLs, and errors. Procedural nodes appear sequentially linked in the graph. Data nodes appear with input and output edges to procedural nodes. The graph browser supports collapsing and expanding some procedural nodes, such as abstraction units and procedural blocks.

@inproceedings{lerner2014b,
  address = {Cologne, Germany},
  author = {Lerner, Barbara and Boose, Emery},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {1--3},
  publisher = {Springer},
  title = {P{O}ST{E}R: {R}Data{T}racker and {D}DG {E}xplorer},
  year = {2014}
}

@inproceedings{lerner2018a,
  author = {Lerner, Barbara and Boose, Emery and Perez, Luis},
  booktitle = {Informatics},
  number = {1},
  pages = {12},
  publisher = {Multidisciplinary Digital Publishing Institute},
  title = {{U}sing {I}ntrospection to {C}ollect {P}rovenance in {R}},
  volume = {5},
  year = {2018}
}

@inproceedings{lerner2014a,
  address = {Cologne, Germany},
  author = {Lerner, Barbara and Boose, Emery},
  booktitle = {Workshop on the Theory and Practice of Provenance},
  pages = {1--4},
  publisher = {USENIX},
  title = {R{D}ata{T}racker: collecting provenance in an interactive scripting environment},
  year = {2014}
}

Sacred

Sacred (greff2015a) is a Python library for configuring, organizing, and logging experiments. Sacred requires users to instrument the code specifying config parameters and main functions.

Sacred provides decorators for instrumenting config, main and other functions. The main function decorator also servers as a command line argument parser that allows users to variate configurations. Before the execution, Sacred collects the source code, the configured parameters, the module dependencies, resources and artifacts (files) and information about the host. To support randomness reproducibility, it also defines and captures random seeds for random and numpy modules. After the execution, it collects the stdout, status code, start and stop times, the main function result and fail-traces.

Sacred stores the provenance in MongoDB, Relationa Database, or JSON files. It has a simple integrated version control system that stores all file contents and keeps references to them through MD5 hashes and name in collections of MongoDB. For analysis, Sacred provides a Web and a desktop tool that presents a log of executions with their configurations and results.

@inproceedings{greff2015a,
  address = {Lille, France},
  author = {Greff, Klaus and Schmidhuber, J{\"u}rgen},
  booktitle = {AutoML Workshop},
  pages = {1--6},
  publisher = {International Machine Learning Society},
  title = {{I}ntroducing {S}acred: {A} Tool to {F}acilitate {R}eproducible {R}esearch},
  year = {2015}
}

SisGExp

SisGExp (cruz2016a) is a web application that allows scientists to manage experiments by registering R scripts with input data. It applies the instrumentation strategy for provenance collection.

For collecting provenance with SisGExp, users need to register R scripts, specifying input data and output targets, and other settings. The system also collects who registered the experiment. After registering the experiments, users can set the scripts to run with Kepler to collect execution time and output data. It uses a generic Meta-Workflow for the script execution.

SisGExp stores the provenance data in a PostgreSQL database and allows users to run SQL queries or to visualize logs in the web application.

@inproceedings{cruz2016a,
  address = {McLean, VA, USA},
  author = {da Cruz, Sergio Manuel Serra and do Nascimento, Jos{\'e} Antonio Pires},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {214--217},
  publisher = {Springer},
  title = {{S}is{G}Exp: {R}ethinking {L}ong-{T}ail {A}gronomic {E}xperiments},
  year = {2016}
}

SPADE

SPADE (tariq2012a) uses the LLVM framework to add provenance instrumentation at compilation time for comprehension. It modifies function calls in programs compiled by an LLVM compiler to insert instrumentation for provenance collection. Thus, it applies the overriding strategy. This approach collects function calls, arguments and return values as execution provenance.

This approach can be used for scripting language that uses LLVM compilers (DROPBOX, 2016). However, it only accesses primitive types, since the LLVM has no information about language-specific data structures. While this limitation does not prevent lineage collection, it may prevent the understanding in some scripting languages that use language specific data structures to wrap primitive types.

SPADE (tariq2012a) produce an PROV graph that presents function calls as activities, as the result of this approach. They either print out the result to the standard output or launches a server and send the results to SPADE through TCP sockets. Since the provenance of all function calls can be overwhelming, they support using SPADE filters declared by users to filter out provenance records during collection.

@inproceedings{gehani2012a,
  author = {Gehani, Ashish and Tariq, Dawood},
  booktitle = {International R User Conference},
  pages = {101--120},
  publisher = {Springer-Verlag New York, Inc.},
  title = {S{P}AD{E}: support for provenance auditing in distributed environments},
  year = {2012}
}

@inproceedings{tariq2012a,
  address = {Boston, MA, USA},
  author = {Tariq, Dawood and Ali, Maisem and Gehani, Ashish},
  booktitle = {Workshop on the Theory and Practice of Provenance},
  pages = {1--5},
  publisher = {USENIX},
  title = {{T}owards {A}utomated {C}ollection of {A}pplication-{L}evel {D}ata {P}rovenance},
  year = {2012}
}

@inproceedings{moore2013a,
  address = {Lombard, IL, USA},
  author = {Moore, Scott and Gehani, Ashish and Shankar, Natarajan},
  booktitle = {Workshop on the Theory and Practice of Provenance},
  publisher = {USENIX},
  title = {{D}eclaratively {P}rocessing {P}rovenance {M}etadata},
  year = {2013}
}

@inproceedings{gehani2014a,
  address = {Cologne, Germany},
  author = {Gehani, Ashish and Tariq, Dawood},
  booktitle = {Workshop on the Theory and Practice of Provenance},
  publisher = {USENIX},
  title = {{P}rovenance-only {I}ntegration},
  year = {2014}
}

@inproceedings{gehani2011a,
  author = {Gehani, Ashish and Tariq, Dawood and Baig, Basim and Malik, Tanu},
  booktitle = {IEEE International Symposium on Policies for Distributed Systems and Networks},
  pages = {149--152},
  publisher = {IEEE},
  title = {{P}olicy-based integration of provenance metadata},
  year = {2011}
}

@inproceedings{gehani2016a,
  author = {Gehani, Ashish and Kazmi, Hasanat and Irshad, Hassaan},
  booktitle = {Workshop on the Theory and Practice of Provenance},
  pages = {26--33},
  publisher = {USENIX Association},
  title = {{S}caling spade to \textquotedblleft big provenance\textquotedblright },
  year = {2016}
}

StarFlow

StarFlow (angelino2011a; angelino2011a) collects provenance of Python scripts for understanding file dependencies, abstracting workflows and running functions in parallel. It assumes that users may want to understand experiments even before running them. For understanding, it considers that dynamic analysis is helpful but not fundamental. Thus, it combines dynamic runtime analysis, static analysis, and annotations to collect provenance propagation.

StarFlow allows users to annotate functions to represent function inputs as dependencies of function outputs. It supports specifying dependencies statically through named arguments or dynamically through decorators that specify which argument represents each dependency.

During static analysis, StarFlow uses the Python last module to extract most control flow, functional dependency, and annotations as definition provenance. In addition to control flow and functional dependency, StarFlow also observes all import statements to collect module dependencies as deployment provenance. StarFlow considers the dependency provenance self-contained in the scripts. Thus, sharing the code also shares dependencies.

During dynamic analysis, StarFlow applies the passive monitoring strategy by setting a Python tracing function that walks the function stack to collect execution provenance. The tracing function extract function calls with their stack and identifies which files each function access. StarFlow supports logging the execution provenance, comparing to results of static analysis to check for consistency, or creating dependencies not captured by annotations. StarFlow also applies the overriding strategy for execution provenance collection by overriding the open function into an enriched version that collects the provenance.

StarFlow uses the static dependency information to produce a pipeline graph. StarFlow produces this pipeline graph by collating functions according to their annotations. For instance, suppose there are three annotated functions: step1, step2, and step3. The function step1 has an annotation that specifies that it reads input1.dat and writes intermediate.dat. The function step2 reads intermediate.dat and writes output.dat. Finally, step3 reads intermediate.dat and input2.dat and writes graph.png. With this information, StarFlow produces an executable pipeline where step1 executes before step2 and step3; and these two steps execute in parallel. After creating this pipeline graph, StarFlow can distribute function executions to a cluster and coordinate the dependency transference. Additionally, StarFlow can use the dependency information to determine which functions should be re-executed. For instance, by changing input2.dat, StarFlow just need to re-execute step3.

Since StarFlow runs over Python scripts and enables parallelization through static analysis, it supports defining abstract workflows as scripts that create other scripts with static annotations. Then, it provides some special functions for coordinating the execution of these generated scripts. Users can use abstract workflows for parameter sweeping.

Executing the StarFlow pipeline in parallel is optional. Since it uses Python scripts, users can determine themselves the execution order through common function calls. In this case, users can use the collected provenance for understanding. When using provenance for understanding, using function annotations is optional as well, since the dynamic analysis is able to collect all executed functions as dependencies.

StarFlow stores provenance on the disk in a serializable format. It supports storing CSV, XML and RDF consistent with the OPM format. For analysis, StarFlow provides a set of Python command-line tools for navigating dependencies downstream and upstream, determining the script pipeline.

Angelino et al. (2011) integrate StarFlow to PASS through the DBAPI provided by the second version of PASS. This integration combines system provenance (i.e. files and processes) with script specific provenance (i.e. function calls, annotation). Additionally, it allows StarFlow provenance to be stored in PASS database.

@inproceedings{angelino2010a,
  address = {Troy, NY, USA},
  author = {Angelino, Elaine and Yamins, Daniel and Seltzer, Margo},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {236--250},
  proceedings = {LNCS 6378},
  publisher = {Springer},
  title = {{S}tar{F}low: {A} script-centric data analysis environment},
  year = {2010}
}

@inproceedings{angelino2011a,
  address = {Heraklion, Crete, Greece},
  author = {Angelino, Elaine and Braun, Uri and Holland, David A and Margo, Daniel W},
  booktitle = {Workshop on the Theory and Practice of Provenance},
  pages = {1--6},
  publisher = {USENIX},
  title = {{P}rovenance {I}ntegration {R}equires {R}econciliation},
  year = {2011}
}

Sumatra

Sumatra (davison2012a) uses provenance to support reproducibility. It provides commands for re-executing trials and comparing previous trial results. Sumatra collects deployment, definition and execution provenance before and after each trial. Additionally, it keeps the evolution provenance and supports tagging trials.

When a user runs a program or script with Sumatra, it collects hardware information, operating system version, and binary executable version as deployment provenance; it collects input files as definition provenance; and it collects output files, terminal output, program arguments and execution duration as execution provenance. Sumatra applies the post-mortem strategy for execution provenance collection, as it identifies files that existed before the execution and compares to new files. For Python scripts, Sumatra also collects imported modules with their versions as deployment provenance and the script definition as definition provenance.

Sumatra uses both a version control system and a relational database (SQLite) to store provenance content. It stores metadata in the SQLite database and files in the version control system. Using version control systems gives Sumatra the ability to track provenance evolution.

Sumatra has the goal of supporting reproducibility. Thus, it provides a web application and command line tools that list trials, summarizes them, compares them, and supports describing previous executions. Additionally, Sumatra provides a command line tools for repeating trials using the same arguments. This command checks if there are changes on the collected data before running.

@article{davison2014a,
  author = {Davison, Andrew P and Mattioni, Michele and Samarkanov, Dmitry and Tele{'n}czuk, B},
  journal = {Book},
  publisher = {CRC Press},
  title = {{S}umatra: a toolkit for reproducible research},
  volume = {57},
  year = {2014}
}

@article{davison2012a,
  author = {Davison, Andrew},
  journal = {Computing in Science \& Engineering},
  number = {4},
  pages = {48--56},
  publisher = {AIP Publishing},
  title = {{A}utomated capture of experiment context for easier reproducibility in computational research},
  volume = {14},
  year = {2012}
}

Variolite

Variolite (kery2017a) is a Atom plugin that allows scientists to manage their scripts' variants by selecting parts they want to change. It applies the post-mortem strategy to collect the output of variant execution and associate it to the variant version.

Variolite wraps the execution of the scripts to record parameters used, and inputs/outputs from the run. It saves this data in JSON files separate from the code. Variolite lets the user decide whether they want to store copies of the results or just pointers. When the user runs the code, Variolite creates a commit with the variant and the provenance. Users can also tag versions to tell what their execution where about.

Variolite allows users to navigate on previous versions by checking their outputs and restoring their versions. It keeps tracks on all branchs of variants in the code. Variolite keeps revision trees for both the files and variant boxes. File variants use references to varint boxes.

@inproceedings{kery2017a,
  address = {Denver, USA},
  author = {Kery, Mary Beth and Horvath, Amber and Myers, Brad},
  booktitle = {Conference on Human Factors in Computing Systems},
  pages = {1--12},
  publisher = {ACM},
  title = {{V}ariolite: {S}upporting {E}xploratory {P}rogramming by {D}ata {S}cientists},
  year = {2017}
}

@inproceedings{kery2017c,
  author = {Kery, Mary Beth},
  booktitle = {IEEE Symposiumon Visual Languages and Human-Centric Computing},
  pages = {321--322},
  publisher = {IEEE},
  title = {{T}ools to support exploratory programming with data},
  year = {2017}
}

VCR

Gavish and Donoho (gavish2011a) propose using provenance to create a VCR (Verifiable Computational Result) identifiable by a VRI (Verifiable Result Identifier) as a link to a repository. This repository has both the goal of storing provenance and supporting data and metadata analysis. Results created under the same conditions should carry the same VRI. Thus, users can apply provenance for repeatability checking.

For capturing provenance in scripts, they propose a plugin with implementations for R, Matlab, and Python. The plugin includes four commands for provenance tracking and transference to a repository: verifiable, chronicled, repository, and loadvcr. Users need to instrument their scripts with these commands.

The verifiable command assigns a VRI (Verifiable Result Identifier) to a variable, turning it into a VCR (i.e. a monitored object with a verifiable computational result). The chronicled command records the function activation tree with function durations. The loadvcr command imports a VCR generated be a previous computation. Finally, the repository command indicates the provenance repository URL. During the execution, the VCR plugin sends the provenance to the repository.

@article{gavish2011a,
  author = {Gavish, Matan and Donoho, David},
  journal = {Procedia Computer Science},
  pages = {637--647},
  publisher = {Elsevier},
  title = {{A} universal identifier for computational results},
  volume = {4},
  year = {2011}
}

versuchung

versuchung (dietrich2015a is a framework that allows users to orchestrate executable Python experiments and collect provenance from the dependencies and aggregate data and metadata. The goal of provenance tracking is to support the replicability of the experiment by instrumenting it to collect input sources, output targets, and actual data.

For collecting provenance, users need to specify their experiments using the versuchung embeded DSL (inheriting classes, calling special functions). This allows them to specify input sources (i.e., git repository, parameters), output targets (i.e., files, databases), and resulting data. Versuchung also collects the start and end time of the experiment. Versuchung provides functions for monitoring shell commands and the machine environment, such as the network activity, and the processor utilization.

Versuchung calculates a hash based on the input parameters and creates a metadata file named after this hash in a artifact directory. This metadata file contains a plain-text representation of a Python data structure with all the collected information. Versuchung supports storing the data into SQLite databases as well.

The framework itself can be used to load the metadata file for analysis. Thus, its execution creates a chained Experiment.

versuchung in bundled with dataref, a latex package that allows users to describe data points and annotate them with metadata to describe experiments in the document. Dataref also allows users to validate data points through assertions.

@article{dietrich2015a,
  author = {Dietrich, Christian and Lohmann, Daniel},
  journal = {SIGOPS Operating Systems Review},
  number = {1},
  pages = {51--60},
  publisher = {ACM},
  title = {{T}he dataref versuchung: {S}aving time through better internal repeatability},
  volume = {49},
  year = {2015}
}

WISE

Workflow Instrumentation for Structure Extraction (WISE) (acuna2015b; acuna2015c; acuna2015a; acuna2016a) captures execution provenance from Python scripts to support dataflow comprehension, allowing users to recreate experiments with workflow systems. WISE produces provenance graphs in two steps: automatic instrumentation and provenance summarization.

The automatic instrumentation step applies the overriding strategy by modifying the original script source code to include overridden versions of built-in methods. These overridden methods log events to produce a trace with interactions between the script and the file system. As it modifies source codes, WISE backups all scripts before instrumenting them. WISE instruments not only the main script, but it also follows imports recursively.

WISE uses the overridden methods to record internal and external events. Internal events represent the script accessing a file. External events represent system calls or data operations (e.g. copying a file). External events allow identifying which tools the script use. During the collection of these events, WISE also performs the post-mortem strategy, by comparing files that existed before the events to files found after. Thus, it is possible to define the outputs of these tools for the provenance graph. WISE requires the tools to state the filename in their arguments.

After collecting the provenance, WISE produces a graph and summarizes it. The graph presents events as nodes and files as edges. In addition to the event nodes, the graph contains three special nodes: source, library, and sink. The source node produces the initial input to the program. The library nodes produce all files that existed in the directory before the script execution. Finally, the sink nodes depend on every file that exists at the end of the execution. WISE represents two types of file dependencies: direct and indirect. Direct dependencies occur when the event use the exact filename in the invocation. Indirect dependencies occur when the event uses a substring to refer to the file. WISE combines all events that correspond to the execution of the same programs for situations in which the user uses pipes or parallel.

For summarization, WISE identifies nodes with the same character (invocation or dataflow) and combines them into regions to remove repetitions from the dataflow. WISE includes collectors and dispensers as extra nodes in the graph to keep the execution and dataflow structure equivalent to the previous one. For better comprehension, WISE optionally applies skeletonization as an extra step to remove details. Skeletonizations break the workflow equivalence. WISE produces a GraphML file with the final graph.

@mastersthesis{acuna2015b,
  author = {Acu{\~n}a, Ruben},
  school = {ARIZONA STATE UNIVERSITY},
  title = {{U}nderstanding {L}egacy {W}orkflows through {R}untime {T}race {A}nalysis},
  website = {http://bioinformatics.engineering.asu.edu/WISE/},
  year = {2015}
}

@article{acuna2015c,
  author = {Acu{\~n}a, Ruben and Chomilier, Jacques and Lacroix, Zo{\'e}},
  journal = {Journal of Integrative Bioinformatics},
  number = {3},
  pages = {277--277},
  title = {{M}anaging and {D}ocumenting {L}egacy {S}cientific {W}orkflows},
  volume = {12},
  website = {http://bioinformatics.engineering.asu.edu/WISE/},
  year = {2015}
}

@inproceedings{acuna2016a,
  address = {Laguna Hills, California, USA},
  author = {Acu{\~n}a, Ruben and Lacroix, Zo{\'e}},
  booktitle = {International Conference on Semantic Computing},
  pages = {9--16},
  publisher = {IEEE},
  title = {{E}xtracting {S}emantics from {L}egacy {S}cientific {W}orkflows},
  website = {http://bioinformatics.engineering.asu.edu/WISE/},
  year = {2016}
}

@inproceedings{acuna2015a,
  address = {New York, USA},
  author = {Acu{\~n}a, Ruben and Lacroix, Zo{\'e} and Bazzi, Rida A},
  booktitle = {International Conference on Cloud Computing},
  pages = {114--121},
  publisher = {IEEE},
  title = {{I}nstrumentation and {T}race {A}nalysis for {A}d-{H}oc {P}ython {W}orkflows in {C}loud {E}nvironments},
  website = {http://bioinformatics.engineering.asu.edu/WISE/},
  year = {2015}
}

YesWorkflow

YesWorkflow (mcphillips2015a, mcphillips2015b) collects definition provenance based on simple user annotations embedded inside comments of any programming language for comprehension. It uses the annotations to build a workflow model that represents the script. These annotations should specify program blocks with ports, and channels to connect the blocks. While such simple annotations allow a low entry bar for adoption, they may not represent what the scripts really do. For capturing the definition provenance from theses annotations, YesWorkflow parses the script.

In addition to definition provenance, YesWorkflow also collects execution provenance through the post mortem strategy. YesWorkflow supports using URI templates in channels to define input and output files. The URI templates are based on the idea that many scientists already use directory structures and file names to organize data produced by scripts. After the script execution, YesWorkflow checks which files match the URI templates and collect them as execution provenance.

YesWorkflow exports the resulting provenance to PROV, Datalog, and GraphViz dot files. For graph visualization, it supports three formats: process-centric, data-centric and combined view. The process-centric format presents blocks as nodes and channels as edges. The data-centric format presents channels as nodes and blocks as edges. Finally, the combined view presents both blocks and channels as nodes.

@article{mcphillips2015a,
  author = {McPhillips, Timothy and Song, Tianhong and Kolisnik, Tyler and Aulenbach, Steve and Belhajjame, Khalid and Bocinsky, Kyle and Cao, Yang and Chirigati, Fernando and Dey, Saumen and Freire, Juliana and others},
  doi = {10.2218/ijdc.v10i1.370},
  journal = {International Journal of Digital Curation},
  number = {1},
  pages = {298--313},
  title = {{Y}es{W}orkflow: a user-oriented, language-independent tool for recovering workflow information from scripts},
  volume = {10},
  year = {2015}
}

@inproceedings{mcphillips2015b,
  address = {Edinburgh, Scotland},
  author = {McPhillips, Timothy and Bowers, Shawn and Belhajjame, Khalid and Lud{\"a}scher, Bertram},
  booktitle = {Workshop on the Theory and Practice of Provenance},
  pages = {1--7},
  publisher = {USENIX},
  title = {{R}etrospective provenance without a runtime provenance recorder},
  year = {2015}
}

YW*NW

YW*NW (dey2015a; pimentel2016c) combines YesWorkflow definition provenance collection with noWorkflow fine-grained execution provenance collection to abstract the workflow representation to only what users consider important. It uses datalog inference rules to match YesWorkflow channels to noWorkflow variables. Thus, it supports querying both YesWorkflow and noWorkflow data with datalog.

@techreport{pimentel2016d,
  author = {Pimentel, Jo{\~a}o Felipe and Dey, Saumen and McPhillips, Timothy and Belhajjame, Khalid and Koop, David and Murta, Leonardo and Braganholo, Vanessa and Lud{\"a}scher, Bertram},
  link = {github.com/gems-uff/yin-yang-demo},
  title = {{Y}in \& {Y}ang {D}emo {G}it{H}ub},
  year = {2016}
}

@inproceedings{dey2015a,
  address = {Edinburgh, Scotland},
  author = {Dey, Saumen and Belhajjame, Khalid and Koop, David and Raul, Meghan and Lud{\"a}scher, Bertram},
  booktitle = {Workshop on the Theory and Practice of Provenance},
  pages = {1--7},
  publisher = {USENIX},
  title = {{L}inking prospective and retrospective provenance in scripts},
  year = {2015}
}

@inproceedings{pimentel2016c,
  address = {McLean, VA, USA},
  author = {Pimentel, Jo{\~a}o Felipe and Dey, Saumen and McPhillips, Timothy and Belhajjame, Khalid and Koop, David and Murta, Leonardo and Braganholo, Vanessa and Lud{\"a}scher, Bertram},
  booktitle = {International Provenance and Annotation Workshop},
  pages = {161--165},
  publisher = {Springer},
  title = {{Y}in \& {Y}ang: demonstrating complementary provenance from no{W}orkflow \& {Y}es{W}orkflow},
  year = {2016}
}