spec.v2.prov-template.html Maven / Gradle / Ivy
PROV-TEMPLATE: A Template System for PROV Documents
v
This document describes a templating system to generate PROV.
This is the first release of the prov-template specification.
Introduction
Generating provenance compatible with the PROV
data model [[prov-dm]] remains challenging. Indeed, all serializations of PROV, whether
RDF [[prov-o]], XML [[prov-xml]], text [[prov-n]], or JSON
[[prov-json]] have got their own syntactic quirks, which make them
difficult to generate directly. Likewise, specialized toolkits such as
ProvToolbox and ProvPy require non trivial programming expertise.
Thus, recognizing that very often provenance follows patterns that are repeated during
the lifetime of an application, we propose a template system for PROV, with the following characteristics:
- It follows a declarative approach, according to which a pattern
of provenance graph can be declared, specifying some variables acting as placeholder for values to be specified; the
pattern can be instantiated multiple times by providing bindings for
these variables.
- It allows a decoupling of the code that instruments the
application and the provenance generation component, the latter being
handled automatically by means of pattern expansion.
- To avoid the proliferation of languages and serializations, patterns and bindings are themselves
expressed as PROV documents, allowing tools to be applied to them, to
analyse, check or validate them.
Namespaces
The following namespaces and prefixes are used throughout this document.
Table 1 Prefix and namespaces used in this specification
Prefix
Namespace IRI
Definition
tmpl
http://openprovenance.org/tmpl#
The prov-template namespace
var
http://openprovenance.org/var#
The namespace for template variables
vargen
http://openprovenance.org/vargen#
The namespace for template gensym variables
prov http://www.w3.org/ns/prov# The PROV namespace
xsd http://www.w3.org/2001/XMLSchema# The XSD namespace
All other namespace prefixes are used in examples only. In
particular, IRIs starting with http://example.org/ represent
some application-dependent IRI [RFC3987].
Conventions
The key words "MUST", "MUST NOT", "REQUIRED", "SHALL", "SHALL NOT", "SHOULD", "SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this document are to be interpreted as described in [[RFC2119]].
Unless otherwise stated, examples throughout this document use the PROV-N Provenance Notation [[prov-n]].
In this specification, we use the
term Qualified
Name in accordance to [[prov-dm]], and use its syntax as specified in [[prov-n]]
(see
production prod-QUALIFIED_NAME).
In compliance with [[prov-dm]], we note that a qualified name can be
mapped to an IRI by concatenating the IRI associated with the prefix
and the local part.
Template Definition and Examples
A PROV-template is a PROV document that:
- MUST contain a single bundle;
- MAY contain variables in the form of qualified names var:x and vargen:x in any position where such qualified names are allowed in PROV;
- MAY contain attributes in the prov-template namespace (prefix tmpl).
The following template contains two variables var:a and var:b.
document
prefix ex <http://example.org/>
prefix var <http://openprovenance.org/var#>
bundle ex:b
agent(var:a)
entity(var:b)
wasAttributedTo(var:b, var:a)
endBundle
endDocument
With the following values, ex:ag and ex:en, for var:a and var:b, respectively, the template expands to the following document:
document
prefix ex <http://example.org/>
prefix tmpl <http://openprovenance.org/tmpl#>
bundle ex:b
agent(ex:ag,[tmpl:order = "[0]"])
entity(ex:en,[tmpl:order = "[0]"])
wasAttributedTo(ex:en, ex:ag,[tmpl:order = "[0, 0]"])
endBundle
endDocument
The role of the attribute tmpl:order is explained below.
Multiple values are allowed for each variable, for instance ex:ag1, ex:ag2 for var:a, and ex:en1, ex:en2, ex:en3, for var:b. By default, the Cartesian product of the value sets forms the set of all possibilities to instantiate each statement in the template. We obtain:
document
prefix ex <http://example.org/>
prefix tmpl <http://openprovenance.org/tmpl#>
bundle ex:b
agent(ex:ag1,[tmpl:order = "[0]"])
agent(ex:ag2,[tmpl:order = "[1]"])
entity(ex:en1,[tmpl:order = "[0]"])
entity(ex:en2,[tmpl:order = "[1]"])
entity(ex:en3,[tmpl:order = "[2]"])
wasAttributedTo(ex:en1, ex:ag1,[tmpl:order = "[0, 0]"])
wasAttributedTo(ex:en1, ex:ag2,[tmpl:order = "[1, 0]"])
wasAttributedTo(ex:en2, ex:ag1,[tmpl:order = "[0, 1]"])
wasAttributedTo(ex:en2, ex:ag2,[tmpl:order = "[1, 1]"])
wasAttributedTo(ex:en3, ex:ag1,[tmpl:order = "[0, 2]"])
wasAttributedTo(ex:en3, ex:ag2,[tmpl:order = "[1, 2]"])
endBundle
endDocument
In the expanded document, the attribute tmpl:order indicates which combination of variable values is used to instantiate the current statement. In the wasAttributedTo statement, it indicates that two independent groups of variables are considered. There are two possible values (denoted by index 0 and 1) for the first group (i.e., variable var:a) and three possible values (denoted by index 0, 1, and 2) for the second group (i.e., variable var:b).
In some cases, the Cartesian product of possibilities is not desirable. For instance, one may want the following values: ex:ag1, ex:ag2 for var:a, and ex:en1, ex:en2 for var:b. By default, the Cartesian product of the value sets forms the set of all possibilities to instantiate each statement in the template. Instead, here, we want var:b to be associated with ex:en1 whenever var:a is associated with ex:ag1. Hence, we modify the template of Example REF by adding the tmpl:linked attribute in the template, which indicates that variables var:b, var:a belong to a same group, change value in a lockstep manner.
document
prefix ex <http://example.org/>
prefix var <http://openprovenance.org/var#>
bundle ex:b
agent(var:a, [tmpl:linked='var:b'])
entity(var:b)
wasAttributedTo(var:b, var:a)
endBundle
endDocument
The expansion now looks like the following. We see that the value ex:en1 is used at the same time as ex:ag1.
document
prefix ex <http://example.org/>
prefix tmpl <http://openprovenance.org/tmpl#>
bundle ex:b
agent(ex:ag1,[tmpl:order = "[0]"])
agent(ex:ag2,[tmpl:order = "[1]"])
entity(ex:en1,[tmpl:order = "[0]"])
entity(ex:en2,[tmpl:order = "[1]"])
wasAttributedTo(ex:en1, ex:ag1,[tmpl:order = "[0]"])
wasAttributedTo(ex:en2, ex:ag2,[tmpl:order = "[1]"])
endBundle
endDocument
In the expanded document, the attribute tmpl:order indicates that a single group of variables is used for the wasAttributedTo statement: it has two possible values (denoted by index 0 and 1).
As we describe the expansion algorithm, it is useful to distinguish two types of variables.
A group variable is a variable that occurs in influencer or influencee position, in secondary position (e.g., plan in association, activity in delegation and derivation), or in mandatory identifier position (in entity, agent, or activity). The idea of a group variable is that it is instantiated as part of the Cartesian product of all possible groups in the current statement.
A statement-level variable is a variable that occurs in an attribute-value pair (either in attribute position or in value position), or that occurs in optional identifier position. While the Cartesian product of a statement's applicable groups dictates the number of instances of that statement, for each such statement we will find a value of statement-level variables.
A bundle variable is a variable that occurs as identifier of a bundle. A bundle variable is intended to be associated with one value only.
For a template to be valid, a variable MUST NOT be both an statement-level variable and group variable.
A bundle variable MAY also be a group variable in some statement, but a bundle variable MUST NOT be a statement-level variable.
A template may contain a statement-level variable. For instance, var:c is a statement-level variable occurring in value position, in the attribute-value pair prov:type='var:c'.
document
prefix ex <http://example.org/>
prefix var <http://openprovenance.org/var#>
bundle ex:b
agent(var:a)
entity(var:b)
wasAttributedTo(var:b, var:a,[prov:type='var:c'])
endBundle
endDocument
The group variables determine the number of instantiations of a given statement according to their group's Cartesian product. So,
following Example REF, we consider values
ex:ag1, ex:ag2 for var:a,
and ex:en1, ex:en2, ex:en3,
for var:b. This leads to six different possibilities. It is
expected that var:c is bound to six different values, one for
each combination of values for var:a and var:b.
However, attributes can be repeated in a statement, so var:c should be bound to six groups of values.
For instance, for the following groups of values
[{ex:t1},
{ex:t2a, ex:t2b},
{ex:t3},
{ex:t4},
{ex:t5a, ex:t5b, ex:t5c},
{ex:t6}], we obtain the following expansion.
document
prefix ex <http://example.org/>
prefix tmpl <http://openprovenance.org/tmpl#>
bundle ex:b
agent(ex:ag1,[tmpl:order = "[0]"])
agent(ex:ag2,[tmpl:order = "[1]"])
entity(ex:en1,[tmpl:order = "[0]"])
entity(ex:en2,[tmpl:order = "[1]"])
entity(ex:en3,[tmpl:order = "[2]"])
wasAttributedTo(ex:en1, ex:ag1,[prov:type = 'ex:t1', tmpl:order = "[0, 0]"])
wasAttributedTo(ex:en1, ex:ag2,[prov:type = 'ex:t2a', prov:type = 'ex:t2b', tmpl:order = "[1, 0]"])
wasAttributedTo(ex:en2, ex:ag1,[prov:type = 'ex:t3', tmpl:order = "[0, 1]"])
wasAttributedTo(ex:en2, ex:ag2,[prov:type = 'ex:t4', tmpl:order = "[1, 1]"])
wasAttributedTo(ex:en3, ex:ag1,[prov:type = 'ex:t5a', prov:type = 'ex:t5b', prov:type = 'ex:t5c', tmpl:order = "[0, 2]"])
wasAttributedTo(ex:en3, ex:ag2,[prov:type = 'ex:t6', tmpl:order = "[1, 2]"])
endBundle
endDocument
Algorithm
Variable Grouping
We define a group of variables, or group for short, as a set of group variables that are expected to change values in lockstep with each other. Groups are identified by a natural number. A pattern's grouping of group variables is a partitioning of the set of variables occurring in a pattern; each partition forms a group and is allocated a natural number.
A binding is an association between a variable and some values. If there is a binding for a variable, the variable is said to be bound.
If there is no binding for a variable, the variable is said to be unbound.
All group variables belonging to a given group MUST be bound to the same number of values, since their values have to change in lockstep manner. If this condition is not satisfied, it is an error situation (see error
IncorrectNumberOfBindingsForGroupVariable).
A group variable SHALL NOT belong to more than one group.
The algorithm to deterministically create a grouping of variables for a pattern P is as follows.
Grouping function (Pattern: P) {
List<Variable> variable_list=sort(extract_group_variables(P));
Hashtable <Variable,Set<Variable>> linked = extract_linked_variables(P); // includes transitive closure
Grouping g = new Grouping();
int count = 0;
for (Variable v: variable_list) {
if (!(belong v g)) {
add v and linked.get(v) to g as group identified by count;
}
count++;
}
return g;
}
First, all group variables are extracted from the pattern and sorted by alphabetical order of their URIs. Then, all linked variables are computed, in the form of a map, associating a variable to the variables it is linked with. It is assumed that the transitive closure of this relation is computed here. Then, each variable (if not already inserted in the grouping) is added to the grouping with the variables it is linked with. Groups are deterministically identified, starting with value 0. This procedure is deterministic since it relies on the alphabetical sorting of group variables.
In the absence of tmpl:linked attribute, we have as many groups as variables.
In Example REF and Example REF, the grouping is as follows:
groups variables
0 var:a
1 var:b
In Example REF, the grouping consists of a single group:
groups variables
0 var:a, var:b
Group Usage
A PROV statement MAY contain a set of group variables. A statement's group usage is the list of group identifiers corresponding to these variables; the list of group identifiers is ordered in ascending order. The group usage for a statement without group variable is the empty list []. A list is noted by integers separated by commas occurring in square brackets.
In Example REF, Example REF, and Example REF, the only variable in agent(var:a) is var:a; since var:a belongs to group 0, the current statement's group usage is [0].
Likewise, the only variable in entity(var:b) is var:b, which belongs to group 1; so, the group usage is [1].
Finally, the group variables in wasAttributedTo(var:b, var:a) and
wasAttributedTo(var:b, var:a, [prov:type='var:c'])
are var:a and var:b, with respective groups 0 and 1. So, the group usage is [0,1].
In Example REF, since there is a single group, group usage is [0] for every statement.
Binding Structure
The examples above have shown that group variables belonging to a same group evolve in a lockstep manner, whereas the statement-level variables should be bound to as many group of values as the number of possible instantiations of the statement they occur in.
A group variable or a statement-level variable occurring in optional identifier position is bound to a list of values. A bundle variable is bound to a single value (i.e. a list of length 1). A statement-level variable (not occurring optional identifier position) is bound to a list of list of values: this allows a given attributes to be have 0, 1, or more occurrences in a given statement.
Since all group variables of a group g in a set of bindings B MUST be associated with lists of values having the same length; this length is given by number_of_variable_values(B,g). Otherwise, it is an error, see
IncorrectNumberOfBindingsForGroupVariable.
TBD specifies how a set of bindings can be expressed as a PROV document.
Symbolically, the bindings for Example REF, can be expressed as follows:
variable values
var:a [ex:ag1, ex:ag2]
var:b [ex:en1, ex:en2, ex:en3]
var:c [{ex:t1},
{ex:t2a, ex:t2b},
{ex:t3},
{ex:t4},
{ex:t5a, ex:t5b, ex:t5c},
{ex:t6}]
Variable Value Indexing
Let us consider a set of bindings B and a statement's group usage [g1, ..., gn], where g1, ..., gn are group identifiers, with n ≥1 (meaning that the statement contains at least one group variable). A variable value index is a list [i1, ..., in] of naturals of length n, such that each 0 ≤ ij < number_of_variable_values(B, gj).
A variable value index for a group usage and a set of bindings B denotes a particular combination of variable values.
The expansion algorithm (see TBD) enumerates and sorts all possible variable values indices for a given group usage. For that, we rely on the following index order: in [i1, ..., in], i1 is the least significant integer, and in is the most significant integer.
So, [0,0] precedes [1,0], which precedes [0,1], which precedes [1,1].
Let us consider the bindings Example REF.
The statement
agent(ex:ag1,[tmpl:order = "[0]"])
in the expansion of Example REF has index [0] for group usage [0], denoting value ex:ag1.
The statement
entity(ex:en3,[tmpl:order = "[2]"])
in the expansion of Example REF has index [2] for group usage [1], denoting value ex:en3.
The statement
wasAttributedTo(ex:en2, ex:ag2,[prov:type = 'ex:t4', tmpl:order = "[1, 1]"])
in the expansion of Example REF has index [1,1] for group usage [0,1], denoting values ex:en2 and ex:ag2.
Expansion
Expansion proceeds by processing each statement in turn, instantiating variables according to the following algorithm.
Bundle expansion(Set of bindings: B, Bundle b) {
bid = substitution(bundle_variable(b),B);
l = empty list of statements;
for each statement s in bundle b {
u = the group usage for s;
count = 0;
for all possible variable value index i (sorted by increasing index order) for u {
env = the effective bindings for B and i;
s2 = substitution_group_variable(s,env);
s3 = substitution_statement_level_variable(s2,B,count);
s4 = set_tmpl_order_attribute(s3,i);
count++;
add s4 to l
}
}
return new bundle with bid and l;
}
Table 2 summarizes the syntax and meaning of variables and attributes accepted in a PROV template.
Kinds of parameters and variables supported by the templating system
template variable definition
var:x A variable x to be
replaced by its value according to the expansion algorithm. If no
binding is found, the following rules are applied. If the variable occurs in attribute position, the
attribute is dropped. If in optional position of a statement (see
[[prov-n]], section
2.4), the variable is dropped. If in mandatory position of a
statement, it is an error situation (see error
UnboundMandatoryVariable).
vargen:x A variable x to be
replaced by its value according to the expansion algorithm. If no
binding is found, the following rules are applied. If the variable occurs in attribute position, a unique qualified name (uuid) is
generated. If in optional position of a statement (see
[[prov-n]], section
2.4), the variable is dropped. If in mandatory position of a
statement, a unique qualified name (uuid) is
generated.
template parameter definition
tmpl:linked An attribute associated with a value that MUST be a qualified name also acting as a template variable v2 (with either var or vargen namespace prefix). Its presence in a term with identifier v1 indicates that the variable v2 changes value synchronously with v1.
tmpl:label An attribute associated with a value that MUST be a qualified name var:v also acting as a template variable. If bound, variable var:v MUST be bound to xsd:string values. The expanded current term will contain a prov:label for each value.
tmpl:time An attribute associated with a value that MUST be a qualified name var:v also acting as a template variable. This attribute may only occur in a Generation, Usage, Invalidation, Start, or End term. If var:v is bound, variable var:v MUST be bound to xsd:dateTime values. The expanded current term will be provided the corresponding time information.
tmpl:startTime An attribute associated with a value that MUST be a qualified name var:v also acting as a template variable. This attribute may only occur in an Activity term. If var:v is bound, variable var:v MUST be bound to xsd:dateTime values. The expanded current activity will be provided the corresponding start time information.
tmpl:endTime An attribute associated with a value that MUST be a qualified name var:v also acting as a template variable. This attribute may only occur in an Activity term. If var:v is bound, variable var:v MUST be bound to xsd:dateTime values. The expanded current activity will be provided the corresponding end time information.
The idea of the tmpl:order attribute is that it should help
the reconstruction of the set of bindings from an expanded document
and the corresponding template. It is an open question as to whether
this is possible with the current representation.
Dong: Regardless of the feasibility of reconstructing the original bindings, I think this attribute should be optional as it would clutter the resulted provenance, making it look untidy. One could argue that the attributes are artifacts of the templating system and not really asked for.
Should we rename tmpl:order to tmpl:index? Dong: +1
Attribute in expanded document
expanded instance parameter definition
tmpl:order an attribute added to a statement by the expansion process;
its value is the index used to compute the actual binding that defined the current instantiation.
Set of Bindings
Example REF shows a symbolic representation of bindings. One should note that to be self-contained a set of bindings should also contain prefix declarations for all qualified names.
The set of bindings in Example REF can be expressed as a sinmple JSON dictionary as follows:
{
"var": {
"b": [ { "@id": "ex:en1" },
{ "@id": "ex:en2" },
{ "@id": "ex:en3" } ],
"a": [ { "@id": "ex:ag1" },
{ "@id": "ex:ag2" } ],
"c": [ { "@id": "ex:t1" },
[ { "@id": "ex:t2a" },
{ "@id": "ex:t2b" } ],
{ "@id": "ex:t3" },
{ "@id": "ex:t4" },
[ { "@id": "ex:t5a" },
{ "@id": "ex:t5b" },
{ "@id": "ex:t5c" } ],
{ "@id": "ex:t6" } ] },
"vargen": {},
"context": {
"ex": "http://example.org/"
}
}
At the top level, we find entries for variables var, vargen variables vargen and a prefix declaration context.
For each variable (whether var or vargen), we find an entry for that variable (using its local name as key). Possible values are expressed as an array.
The syntax for values borrows from [[json-ld]] to encode values and their types, where apropriate:
- strings: "abc"
- integers: 123
- float: 3.1415
- qualified names: { "@id": "ex:t5a" } with prefix ex expected to be declared in context.
- string with language: { "@value": "bonjour", "@language@": "french" } with prefix ex expected to be declared in context.
- arbitrary xsd type: { "@type": "xsd:dateTime", "@value": "2002-05-30T09:30:10.5" } with prefix xsd known to the template expander
The element context maps prefixes to namespace URIs. The prefixes xsd and prov are reserved and therefore MUST not be declared in that dictionary. They have the same meaning as in the PROV-N specification.
Errors
The following error situations have been identified in the specification prov-template.
UnboundMandatoryVariable: when a var:x variable occurs in mandatory position of a statement and is unbound.
IncorrectNumberOfBindingsForGroupVariable: when group variables of a given group are not bound to the same number of values.
IncorrectNumberOfBindingsForStatementVariable: when a statement-level variable is not bound to the same number of value lists as their are instantiations for the statement it occurs in.
Implementation
ProvToolbox's
executable provconvert can be called from the command line to
perform template expansion as follows:
provconvert -infile template.provn -outfile expanded.provn -bindings bindings.provn
The input file template.provn, the output
file template.provn, and the bindings
file bindings.provn may can be encoded according to any
serialization supported by ProvToolbox.