ProvBatch: Plugin implementation guide
This document describes the interface of plugins for ProvBatch.
Plugins are used for computing attributes, i.e. values associated to each entity. Each plugin computes one or multiple attributes. Each attribute can consist in a single value or a group of values (e.g. a computation result and a score).
Plugins can be written in Python, Nim, Rust and Bash. The plugins system is implemented as a separated package (MultiPlug).
Public Interface of ProvBatch plugins
A plugin must provide the following attributes:
ID
,VERSION
,INPUT
,OUTPUT
: constants describing the plugin itself, the data returned by it and the resources necessary for the computationcompute()
: a function which computes the value of the attribute(s) for a given entity; it may accept computation parameters
Furthermore, a plugin may provide additiional attributes:
METHOD
,IMPLEMENTATION
,REQ_SOFTWARE
,REQ_HARDWARE
,ADVICE
,PARAMETERS
: further metadata constants for plugin descriptioninitialize()
andfinalize()
: functions for setting up and deleting resources used for an entire batch computation (e.g. library or database initializations); these functions are called only once (at the beginning and end of the batch computation), whilecompute
is callled for each entity of the batch
Compute function
The plugin shall export the compute(entity, **kwargs)
function:
entity
: identifier of the input data, or name of the file with the input datakwargs
: additional optional named parameters (described in the PARAMETERS constant, see “Metadata constants section” below); note the special role of thestate
keyword argument if initialize() is defined, see “Shared resources for batch computations” below)
The return value of the method is a 2-tuple (results, logs)
where:
results
: list of values, in the order specified by theOUTPUT
constant (see below),logs
: a possibly empty strings list, containing messages to be displayed to the user or stored in a log file; suggested format: “{key}\t{message}”, where key identifies the type of log message.
Metadata constants
The plugin communicates its purpose, version and interface by defining the following constants.
Mandatory (their combination must be unique among all plugins):
ID
: name of the plugin [str, max len 256]VERSION
: a version number or name [str, max len 64]
Mandatory:
INPUT
: short description of the input (type of data, format, source) [str, max len 512]OUTPUT
: list of strings (attribute ids), describing the order or the attributes in the output; each of them must be defined inattribute_definitions.yaml
.
Required if compute() accepts named parameters:
PARAMETERS
: optional named parameters of the compute function; list of 4-tuples of strings: (name, datatype, default_value, documentation)
Optional: (string, max len 4096)
METHOD
: how the results are computed, conceptuallyIMPLEMENTATION
: how the method is implemented, technicallyREQ_SOFTWARE
: required tools or librariesREQ_HARDWARE
: required hardware resources (memory, GPUs…)ADVICE
: when should this method used instead of others
Common resources for batch computations
Sometimes common resources are needed by multiple instances of a batch computation For example, it could be necessary to parse data files, to load some data into memory, or to initialize a connection to a database. Or some statistics could be collected during the computation and output at the end.
The convention in Prenacs is that such resources are passed around (if they
exist) as a variable called state
. Due to the dynamic nature of Python
variables, this can be a single resource or a container (e.g. a dictionary)
to access multiple heterogenous resources.
Creating common resources
There are different ways to create resources which are accessed by all instances of the batch computation.
First, the (initial) value of the state
variable can be passed as one of the
parameters (called state
) of the batch computation.
Second, an initialization function can be implemented, with the signature
initialize(**kwargs)
. If this is provided, the function is called before the
first instance of the batch computation. The batch computation parameters are
passed to it as keyword arguments. The return value of the initialize
function is passed to the compute
function as a keyword argument
called state
.
Third, these two approaches can be combined: i.e. a value for a state
variable can be specified in the batch computation parameters. If an
initialize
function is provided, this reads the state
variable, along
with any other parameter. The original value of state
is then overwritten with the return value of the initialize
function
and can be accessed by the compute
functions.
Accessing common resources
The common resources are accessed by the plugins compute
function
using a keyword parameter called state
. Thus if common resources are needed,
the function will take the signature: compute(entity, state=None, **kwargs)
.
While accessing the common information (in particular when this is not read-only), it should be considered, that the computation can be started in parallel (this is the default mode in the prenacs-batch-compute script).
Finalizing common resources
In some cases, it is necessary to do finalization operations at the end of the
batch computation - e.g. close some files or connections, or write some
information, collected during the computation.
In such cases a finalize(state)
function can be implemented in the plugin.
It takes the final value of the state
variable, and it is executed
after the last instance of compute
.
Non-Python plugins
For details on how to implement plugins in Nim, Rust and Bash, please refer to the MultiPlug package user manual. This section provides some additional information which applies specifically to ProvBatch plugins.
Nim plugins
The compute function has the following signature, if there are no optional parameters:
proc compute(entity: string):
tuple[results: seq[string]], logs: seq[string]] {.exportpy.}
Any accepted optional parameter must be defined in the signature. E.g. given two optional named parameters p1 of type t1 and default value d1 and p2 of type t2 and default value d2, the signature becomes:
proc compute(entity: string, p1: t1 = d1, p2: t2 = d2):
tuple[results: seq[string]], logs: seq[string]] {.exportpy.}
Shared resources can be e.g. stored in a ref object, which is then passed back and forth to Python:
type
PluginState = ref object of RootObj
state: int
# initialize returns the state
proc initialize(): PluginState {.exportpy.} = PluginState(state = 0)
# it can have optional keyword arguments
proc initialize(s: int = 0): PluginState {.exportpy.} = PluginState(state = s)
# if initialize is provided, then compute must have a mandatory `state`
# keyword argument of the same type as the return value of initialize
proc compute(filename: string, state: PluginState, ...
# finalize is optional and takes the state as only argument
proc finalize(s: PluginState) {.exportpy.} = discard
Rust plugins
The interface of the compute function when written in Rust is:
#[pyfunction] fn compute(filename: &str) -> PyResult<(...)> { ... Ok(...)