datazimmer
To create a new project
make sure that
python
points topython>=3.8
and you havepip
andgit
thenpip install datazimmer
run
dz init project-name
pulls project-template
add a remote
both to git and dvc (can run
dz build-meta
to see available dvc remotes)git remote can be given with
dz init
create, register and document steps in a pipeline you will run in different environments
build metadata to exportable and serialized format with
dz build-meta
if you defined importable data from other artifacts in the config, you can import them with
load-external-data
ensure that you import envs that are served from sources you have access to
build and run pipeline steps by running
dz run
validate that the data matches the datascript description with
dz validate
Scheduling
a project as a whole has a cron expression in
zimmer.yaml
to determine the schedule of rerunsadditionally, aswan projects within the dz project can have different cron expressions for scheduling new runs of the aswan projects
Test projects
TODO: document dogshow and everything else much better here
Lookahead
overlapping names convention
resolve naming confusion with colassigner, colaccessor and table feature / composite type / index base classes
abstract composite type + subclass of entity class
import ACT, inherit from it and specify
importing composite type is impossible now if it contains foreign key :(
add option to infer data type of assigned feature
can be problematic b/c pandas int/float/nan issue
create similar sets of features in a dry way
overlapping in entities
detect / signal the same type of entity
exports: postgres, postgis , superset
W3C compliancy plan
test suite for compliance: https://w3c.github.io/csvw/publishing-snapshots/PR-earl/earl.html
https://github.com/w3c/csvw
https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/
https://www.w3.org/TR/tabular-metadata/
@article{tennison2015model,
title={Model for tabular data and metadata on the web},
author={Tennison, Jeni and Kellogg, Gregg and Herman, Ivan},
year={2015}
}
@article{pollock2015metadata,
title={Metadata vocabulary for tabular data},
author={Pollock, Rufus and Tennison, Jeni and Kellogg, Gregg and Herman, Ivan},
journal={W3C Recommendation},
volume={17},
year={2015}
}
Installation:
using pip
pip install datazimmer
Glossary
Namespace
The atomic unit of the knowledge system containing data and metadata
defines
tables
composite types
entity classes
code to build data based on these
represented by
a module in a data project (as datascript) - nested right below the main (src) module
a set of YAML files in
{namespace name}/**.yaml
as serialized metadata in the released sdistautomatically generated from the code
an exported .py file with basic datascript in
{namespace name}/__init__.py
in the released sdist
can import other namespaces, either to use
data (even for foreign keys in tables)
defined composite types / entity classes
Data Project
A versioned set of interconnected namespaces with metadata and different environments
defines
namespaces
different environments where (usually) the same code runs for different data
represented by
a git repository
is a DVC repository
based on a template
has fixed form tags representing the releases and data versions
Registry
A repository containing data about the releases and dependencies of projects to make importing namespaces straightforward
represented by
a git repository (either local or remote)
write access needed to the repo to release to it
contains data about
(named) projects
URI
versions
environment->dvc remote mapping
contains sdist forms of metadata of projects release there
to set up a special PyPI index so that installation and dependency resolution is outsourced
Metadata
Information about the data contained in projects
defines
for each namespace
tables
composite types
entity classes
represented
in a project repository
defined in code (datascript object)
scrutable
entitybase
compositetypebase
serialized (generated from code)
YAML files
in runtime
converted as soon as possible to dataclasses in bedrock module
some even in data output in parquet
Config
defines
name
version (this is the metadata version, the data version is determined at release)
default-environment name (the first environment in envs config by default)
validation-environments (the default-environment by default)
registry address (the SSCUB registry by default) TODO: link
imported_projects
either a list of project names to be imported, where other than name, all default values are used
or a dictionary, where the key is the project name (in the registry), and values are:
version (metadata version)
data_namespaces - the namespaces where loading the data is required
in
envs
for each environment (one empty env named complete by default)params for all local namespaces and global params (namespace params default to these if not defined)
logged to DVC from here
environments of imported projects (where data is needed)
specific DVC remote
where to push data generated as outputs of running the code from namespaces (TODO - find a proper name, e.g. namespace processor) - identified by the name of a remote defined in DVC config
parent env (default-environment by default)
all missing keys of parameters or imported ns
represented as
zimmer.yaml
in project root
Environment
A complete run of the code in an project with its values for parameters and environments for imported data
defined by config
…
Mock Projects
The dogshow standard projects and explorer are intended to showcase all features of the package.
They have their test registry (TODO: should have sscu-budapest GitHub repo address)
dog-show
dograce
dogsuccess
dogcombine
In the code, chenges from versions v0.0, v0.1 and v1.0 are noted with
comments - # add in v0.1:
at the beginning of a line or
# remove in v1.0
at the end
to test for
rebuilding the code with the same version
trying (and failing) to publish a different codebase for a previously published version
test for changed dependencies
Naming Conventions and Restrictions
project names:
lower case letters and non-duplicated dashes (-) not at either end
environment and namespace names:
lower case letters and non-duplicated underscores (_) not at either end
Metadata
[a-b]*_table
table names based on singular form of entity e.g.dog_table
see all in dogshow standard
Rules
Notes of rules that are necessary for operation. Too strict or stupid rules need to change!
Only one registered function per namespace
unless for separate environments, like with the helper function of data loaders and environment creation
usually, an env run of a namespace processor (is it called this? TODO) reads and writes to one env, the one it corresponds to, but:
can only write to its env
can read from a different one, but then can only read from that one
this allows for registering a function that creates its environment from a base/complete set
TableNameFeatures should be the name of table feature classes if the table name is to be inferred
No composite features with the same prefix in the same table
Feature name can’t contain __ (dunder)
CLI (WIP)
csv format default
core namespace default
datazimmer registry repository default registry
datazimmer pull dog-show --env top_comps
datazimmer pull dogssuccess/sex_matches --parquet
datazimmer pull something --registry uri://sg-registry/...
API
datazimmer Package
sscu-budapest utilities for scientific data engineering
Functions
|
helper function to fill the detected env of a dataset |
|
if project is None, raw data output path is given, otherwise imported |
|
|
|
registers a function to the pipeline the names of parameters will matter and will be looked up in conf/envs.yaml params |
|
Convenience functions to use register with typical parameters |
|
Convenience functions to use register with typical parameters |
Classes
|
|
|
|
|
|
|
|
|
|
|
|
|
Class Inheritance Diagram
digraph inheritance766a52bd73 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "AbstractEntity" [URL="index.html#datazimmer.AbstractEntity",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "ColAssigner" -> "AbstractEntity" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ColAccessor" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="describe and access raw columns"]; "ColAssigner" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="define functions that create columns in a dataframe"]; "ColAccessor" -> "ColAssigner" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CompositeTypeBase" [URL="index.html#datazimmer.CompositeTypeBase",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "ColAssigner" -> "CompositeTypeBase" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DzAswan" [URL="index.html#datazimmer.DzAswan",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "EntityClass" [URL="index.html#datazimmer.EntityClass",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="EntityClass(name: str, identifiers: List[Union[datazimmer.metadata.atoms.PrimitiveFeature, datazimmer.metadata.atoms.CompositeFeature, datazimmer.metadata.atoms.ObjectProperty]] = <factory>, properties: List[Union[datazimmer.metadata.atoms.PrimitiveFeature, datazimmer.metadata.atoms.CompositeFeature, datazimmer.metadata.atoms.ObjectProperty]] = <factory>, parents: List[ForwardRef('EntityClass')] = <factory>, description: Optional[str] = None)"]; "_AtomBase" -> "EntityClass" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Index" [URL="index.html#datazimmer.Index",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "Nullable" [URL="index.html#datazimmer.Nullable",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "PersistentState" [URL="index.html#datazimmer.PersistentState",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="PersistentState()"]; "ReportFile" [URL="index.html#datazimmer.ReportFile",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="ReportFile(filename: str)"]; "ScruTable" [URL="index.html#datazimmer.ScruTable",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "SourceUrl" [URL="index.html#datazimmer.SourceUrl",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "_AtomBase" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; }Release Notes
v0.1.0
start datazimmer from sscutils
v0.1.1
fix build - add some cron logic
v0.1.2
revert build
v0.1.3
v0.1.4
v0.1.5
v0.2.1
init explorer frame
v0.2.2
auth improvement
v0.2.3
tighter explorer
v0.2.4
tighter explorer
v0.2.5
multitable dec
v0.2.6
minimal to dataset level
v0.2.7
scrutable extension
v0.3.0
simplification and unification
v0.3.1
incremental fixes
v0.3.10
flat index and improved secrets
v0.3.2
requirement and explorer updates
v0.3.3
init explorer and os fix
v0.3.4
nb requirement fix
v0.3.5
title detection fix
v0.3.6
explorer upgrades
v0.3.7
freeze jb version
v0.3.8
flexible pulling
v0.3.9
sql loader extension
v0.4.0
aswan integration init
v0.4.1
here goes nothing
v0.4.10
dvc realign, minor extensions and reorg
v0.4.11
stage non cached outputs
v0.4.12
private zenodo
v0.4.13
aswan reset plus scrutable pickle fix
v0.4.14
minimal alignment, pre-dependency fix
v0.4.15
move import
v0.4.2
collect dependency fix
v0.4.3
minor iterations, redundancy removal
v0.4.4
git fixes, unifications
v0.4.5
aswan alignment
v0.4.6
aswan align
v0.4.7
simplifications and raw-data introduction
v0.4.8
properly citeable
v0.4.9
init zenodo integration
v0.5.0
fix dependencies and drop explorer
v0.5.1
many many bugs fixed
v0.5.2
external dvc zimmauth
v0.5.3
properly rm dvc as dependency
v0.5.4
metaclass def bugfix