Prenacs User manual
Implementing a plugin
In order to perform a computation, a plugin must be implemented. A plugin determines how one or multiple attributes of entities are computed. Furthermore, it contains metadata, which is stored alongside the computation results, in order to document the data provenance.
The plugin system is generic and implemented as a separate package,
named MultiPlug
. Plugins for Prenacs must implement a specific interface,
which is described in the
plugin implementation guide.
If a plugin which had been already used for computations changes, its version number must be incremented, so that a new plugin metadata record is created, when the new computation results are loaded into the database.
Checking the plugin implementation
In order to check if a plugin has been implemented correctly, it is possible
to use the prenacs-check-plugin
script. This loads the plugin module and
analyses the exposed programming interface of the plugin, reporting
any error to the user.
Database setup
The library was designed and tested using MariaDB. Other database management systems can be used, in which case some adjustments may be needed. The database must be already setup before the library can be used. The database server must be started.
The following parameters are passed to the scripts, which connect to the database:
--dbuser
: database user to use--dbpass
: password of the database user--dbname
: database name--dbsocket
: connection socket file--dbpfx
: database tables prefix
Before the first use, the script prenacs-setup-database
must be run,
which creates the necessary metadata tables.
Managing the attributes
Before running a computation, the attributes which are computed by the computation plugin must be added to the database.
For this a prenacs-manage-attributes
script is provided.
The attributes metadata must be described in a YAML file,
which contains a dictionary, with one key for each attribute.
Attribute metadata
The attribute metadata files contains a dictionary where each key is a string - the attribute name - and each value is a dictionary - the attribute metadata.
The following keys must be defined in each attribute metadata:
definition
: a free text describing the attributedatatype
: the datatype of the attribute.
Thereby, the datatype shall be expressed as described in the datatypes section of the attrtables package documentation
Furthermore, the following keys can be defined:
computation_group
: multiple attributes which are usually computed together by plugins shall contain a common unique computation group identifier in this fieldontology_xref
: link to an ontology term, which defines the same attributerelated_ontology_terms
: links to ontology terms, which do not directly describe the attribute, but are related to its definitionunit
: the measure unit for the attributeremark
: a free text remark
Example
For example, a file basic_seqstats.yaml
could contain:
gc_content:
definition: fraction of the total bases of a sequence, which are G or C
datatype: Float
computation_group: basic_seqstats
seqlen:
definition: seuqence length
unit: bases
datatype: Integer
computation_group: basic_seqstats
The database can be prepared for storing the attributes defined in the file, using:
prenacs-manage-attributes <dbuser> <dbpass> <dbname> <dbsocket> \
basic_seqstats.yaml
Changing attribute metadata by constant datatype and computation group
If the datatype
or computation_group
of an attribute has not changed,
the prenacs-manage-attributes
script can be run using the --update
option.
This updates the attribute definition record.
For example, the metadata of GC content given in the previous section
could be given slightly differently in a file gc_content.yaml
:
gc_content:
definition: portion of bases of a sequence, which are G or C
datatype: Float
computation_group: basic_seqstats
Since the definition
changes, but not the datatype
or
computation_group
, this metadata could be used for updating the
gc_content
attribute record, as follows:
prenacs-manage-attributes <dbuser> <dbpass> <dbname> <dbsocket> \
--update gc_content.yaml
Changing attribute metadata with different datatype and computation group
If the datatype
or computation_group
of an attribute has changed,
the attribute must be dropped from the database, before adding it again.
This means that the entire data for the attribute will be lost!
This can be achieved by running the prenacs-drop-attribute
script.
After that, the attribute can be re-added by running
prenacs-manage-attributes
with the attribute definition file.
For example, the metadata of GC content given in the previous section
could be given with a different computation_group
value
in a file seq_composition_stats.yaml
:
gc_content:
definition: fraction of the total bases of a sequence, which are G or C
datatype: Float
computation_group: seq_composition_stats
If we want to use this definition, the existing data for GC content must be destroyed, then the attribute definition is re-added, using:
# be careful: this deletes all the data for the attribute!
prenacs-drop-attribute <dbuser> <dbpass> <dbname> <dbsocket> gc_content
prenacs-manage-attributes <dbuser> <dbpass> <dbname> <dbsocket> \
seq_composition_stats.yaml
Batch computing
The prenacs-batch-compute
script is used for running a computation,
using a plugin. This assumes that the attributes computed by the plugin
have been added to the database, as explained in the previous section.
By default the computation is run in parallel, using the
mutliprocessing
module. If a serial computation is desired,
the --mode serial
option can be used. If desired, the computation
can be run on a Slurm cluster (see below).
The batch compute function must locate the entities on which the computation shall be run. The input entities can be provided as a set of entity identifiers can be provided as a list or in a column of a tabular file. Alternatively a set of input files, one per entity, can be provided (specified using a globpattern). These different options are explained in the following sections.
For performance reasons, the computation results are not loaded directly
into the database and instead, they are first saved to a tabular file.
By default the results are output to the standard output, but the --out
option can be used to specify a different file.
The first column contains the entity IDs. The following columns are the
results of the computation in the order specified by the plugin OUTPUT
constant. Note that each of the attribute can in some cases
include multiple columns.
Besides the attribute values, a computation report is output,
which contains the computation metadata in YAML format.
This file is necessary, in order to load the computation results
into the database. The output is by default to the standar error, but
a different file can be specified using the --report
option.
Additional metadata for the report can be provided using further
options (see below).
Finally, a log file with additional information is output (by default to
the standard error, but a different file can be specified using
the --log
option). This contains any further information about the
computation returned by the plugin compute
function.
The output is tabular (tab-separated). The first column contains the
the entity IDs. The following columns contains the information
messages returned by the plugin. Each entity attribute computation
can generate zero, one or multiple lines.
Running on a Slurm cluster
The computation can be run on a computer cluster managed by Slurm.
For this to work the prenacs-batch-compute
script must be called
with the option --mode slurm
.
The user must provide, using the option --slurm-submitter <PATH_TO_SCRIPT>
the path to a Bash script which is passed to the sbatch
command.
An example is given in the library source code prenacs/submit_array_job.sh
,
and this file should be used as a template.
Temporary files are used to perform a computation on a Slurm cluster.
By default, these files are created in the current directory and
removed after the computation is done. The directory of these files
can be specified with the option --slurm-outdir
.
Slurm job arrays are utilized to perform computations for each task
assigned to each input entity. If any of the tasks is failed, the ID of the relevant entity,
the reason for the failure, and the task ID in the Slurm job array
are written to a tab-separated file named failed_tasks_{JOB_ID}.err
,
where the JOB_ID corresponds to the Slurm job ID. If all tasks are failed,
no such file is created, but the user is informed with a message.
The status of the job array is checked and reported to the user every 15 seconds.
Even if some tasks are failed, the results for the remaining completed tasks
are still collected and reported in the output file. The user can then attempt
to compute the failed tasks again using the information provided in the file
failed_tasks_{JOB_ID}.err
. This can be especially useful for jobs that require
a large amount of computation but have few failed tasks for some reason.
Input entities provided as a set of identifiers
If the input entities are specified as a set of entity IDs, the ids
subcommand of prenacs-batch-compute
is used. In this case the filename of the
IDs file must be provided.
The file shall be either a list of entity IDs, with one ID per line, or a
tabular file, in which the entity IDs are contained in a column. In the second
case, the 1-based column number is also specified to prenacs-batch-compute
.
The IDs are passed to the plugin compute()
function as argument, and are
used as first column of the output.
Computed entity IDs
It is possible also to use a list or tabular file is available, containing IDs from which the entity IDs can be computed, instead of the entity IDs themselves.
To achieve this, a module is created, which defines a
compute_id(str)->str
mapping function; the input of the function
is the ID provided in the ID list and the output is the entity ID.
The module filename is passed to batch_compute.py
using the
idsproc
option.
Input entities provided as a set of input files
If the input entities are provided as a set of files, the files
subcommand
of prenacs-batch-compute
is used.
The plugin compute()
function of the plugin gets then the filename as
first argument.
In most cases, it is desirable to compute the entity IDs which shall be
used as a first column of the output.
To achieve this, a module is created, which defines a
compute_id(str)->str
mapping function; the input of the function
is the filename and the output is the entity ID to use in the output.
The module filename is passed to batch_compute.py
using the
idsproc
option.
If no idsproc
module is provided, the first column of the output
contains the input filename instead of an entity ID.
Incremental computations
It is possible to automatically avoid the recomputation for entities, for which a value has already been computed previously.
If a filename is provided using the --skip
option, the entity IDs of
previous computations are read from the first column of this tabular file. If
the ID of an entity in the input (eventually after applying idsproc
) is
equal to one of the entity IDs of previous computations, then the computation
is skipped.
If the --out
option is used and the output file already exists,
the existing output file is also used as --skip
file.
Computation parameters
The computation parameters can be provided to the plugin as a YAML file,
using the --params
option.
The parameters are passed to the compute()
function as keyword
arguments. The type and meaning of the parameters must be described
in the PARAMETERS
constant of the plugin.
The computation parameters
are assumed to remain constant during the computation, i.e. the same
value is seen by each instance of compute()
.
If a parameter is called state
, it is handled in special way
(see next section). For this parameter, it is not assumed that the
value remains constant, and no description is necessary in the
PARAMETERS
plugin metadata constant.
Computation state
Differently from the computation parameters, a computation state can be defined. which possibly changes during the computation. When handling the computation state, it shall be thereby considered that the computation is run by default in parallel, using multiple processes.
The state is created by defining aninitialize()
function in the plugin. The
function is run at the beginning of the batch computation and its return value
of initialize()
is passed as keyword parameter state
to the
compute()
function. If necessary, a finalize()
function can be defined
by the plugin, which can perform teardown operations at the end of the batch
computation.
State initialization parameters
It is possible to provide parameters to the initialization function.
For this, a parameter in the file passed using the --params
option
shall be named state
. This is handled in a special way: it is assumed
to be a dictionary with string keys. Its content is passed
to the initialize()
function as keyword arguments.
The value of state
is then overwritten by the return value of
initialize()
.
Report file metadata
The report file contains the username of the user starting the computation.
By default, this is computed using getpass.getuser()
. However a different
value can be provided using the --user
option.
Furthermore, the name of the system on which the computation is run is added
to the report. By default, this is computed using socket.gethostname()
.
However, a different value can be provided using the --system
option.
Optionally, the reason for the computation can be stated and added to the
report. The valid values for this are contained in the REASONS
constants
of the Report
class of the report.py
module and are:
new_entities
: the computation was run on entities, which did not exist yet in the databasenew_attributes
: the computation was run on entities, which already existed in the database, in order to compute new attributesrecompute
: the computation was run on entities, which already existed in the database, for attributes for which a value was already computed before
Loading the computation results
In order to load the results of the computation into the database, the
prenacs-load-results
script is used. To it the output files of
prenacs-batch-compute
(results and computation report) are passed.
The same plugin used for the batch computing must also be provided, so that the plugin metadata can be stored in the database.