datazimmer

Documentation Status codeclimate codecov pypi DOI

To create a new project

  • make sure that python points to python>=3.8 and you have pip and git then pip install datazimmer

  • run dz init project-name

  • add a remote

    • both to git and dvc (can run dz build-meta to see available dvc remotes)

    • git remote can be given with dz init

  • create, register and document steps in a pipeline you will run in different environments

  • build metadata to exportable and serialized format with dz build-meta

    • if you defined importable data from other artifacts in the config, you can import them with load-external-data

    • ensure that you import envs that are served from sources you have access to

  • build and run pipeline steps by running dz run

  • validate that the data matches the datascript description with dz validate

Scheduling

  • a project as a whole has a cron expression in zimmer.yaml to determine the schedule of reruns

  • additionally, aswan projects within the dz project can have different cron expressions for scheduling new runs of the aswan projects

Test projects

TODO: document dogshow and everything else much better here

Lookahead

  • overlapping names convention

  • resolve naming confusion with colassigner, colaccessor and table feature / composite type / index base classes

  • abstract composite type + subclass of entity class

    • import ACT, inherit from it and specify

    • importing composite type is impossible now if it contains foreign key :(

  • add option to infer data type of assigned feature

    • can be problematic b/c pandas int/float/nan issue

  • create similar sets of features in a dry way

  • overlapping in entities

    • detect / signal the same type of entity

  • exports: postgres, postgis , superset

W3C compliancy plan

  • test suite for compliance: https://w3c.github.io/csvw/publishing-snapshots/PR-earl/earl.html

  • https://github.com/w3c/csvw

    • https://www.w3.org/TR/2015/REC-tabular-data-model-20151217/

    • https://www.w3.org/TR/tabular-metadata/

@article{tennison2015model,
  title={Model for tabular data and metadata on the web},
  author={Tennison, Jeni and Kellogg, Gregg and Herman, Ivan},
  year={2015}
}
@article{pollock2015metadata,
  title={Metadata vocabulary for tabular data},
  author={Pollock, Rufus and Tennison, Jeni and Kellogg, Gregg and Herman, Ivan},
  journal={W3C Recommendation},
  volume={17},
  year={2015}
}

Installation:

using pip

pip install datazimmer

Glossary

Namespace

The atomic unit of the knowledge system containing data and metadata

  • defines

    • tables

    • composite types

    • entity classes

    • code to build data based on these

  • represented by

    • a module in a data project (as datascript) - nested right below the main (src) module

    • a set of YAML files in {namespace name}/**.yaml as serialized metadata in the released sdist

      • automatically generated from the code

    • an exported .py file with basic datascript in {namespace name}/__init__.py in the released sdist

  • can import other namespaces, either to use

    • data (even for foreign keys in tables)

    • defined composite types / entity classes

Data Project

A versioned set of interconnected namespaces with metadata and different environments

  • defines

    • namespaces

    • different environments where (usually) the same code runs for different data

  • represented by

    • a git repository

      • is a DVC repository

      • based on a template

      • has fixed form tags representing the releases and data versions

Registry

A repository containing data about the releases and dependencies of projects to make importing namespaces straightforward

  • represented by

    • a git repository (either local or remote)

    • write access needed to the repo to release to it

  • contains data about

    • (named) projects

      • URI

      • versions

      • environment->dvc remote mapping

  • contains sdist forms of metadata of projects release there

    • to set up a special PyPI index so that installation and dependency resolution is outsourced

Metadata

Information about the data contained in projects

  • defines

    • for each namespace

      • tables

      • composite types

      • entity classes

  • represented

    • in a project repository

      • defined in code (datascript object)

        • scrutable

        • entitybase

        • compositetypebase

      • serialized (generated from code)

        • YAML files

    • in runtime

      • converted as soon as possible to dataclasses in bedrock module

    • some even in data output in parquet

Config

  • defines

    • name

    • version (this is the metadata version, the data version is determined at release)

    • default-environment name (the first environment in envs config by default)

    • validation-environments (the default-environment by default)

    • registry address (the SSCUB registry by default) TODO: link

    • imported_projects

      • either a list of project names to be imported, where other than name, all default values are used

      • or a dictionary, where the key is the project name (in the registry), and values are:

        • version (metadata version)

        • data_namespaces - the namespaces where loading the data is required

    • in envs for each environment (one empty env named complete by default)

      • params for all local namespaces and global params (namespace params default to these if not defined)

        • logged to DVC from here

      • environments of imported projects (where data is needed)

      • specific DVC remote

        • where to push data generated as outputs of running the code from namespaces (TODO - find a proper name, e.g. namespace processor) - identified by the name of a remote defined in DVC config

      • parent env (default-environment by default)

        • all missing keys of parameters or imported ns

  • represented as zimmer.yaml in project root

Environment

A complete run of the code in an project with its values for parameters and environments for imported data

  • defined by config

Mock Projects

The dogshow standard projects and explorer are intended to showcase all features of the package.

They have their test registry (TODO: should have sscu-budapest GitHub repo address)

  • dog-show

  • dograce

  • dogsuccess

  • dogcombine

In the code, chenges from versions v0.0, v0.1 and v1.0 are noted with comments - # add in v0.1: at the beginning of a line or # remove in v1.0 at the end

  • to test for

  • rebuilding the code with the same version

  • trying (and failing) to publish a different codebase for a previously published version

  • test for changed dependencies

Naming Conventions and Restrictions

  • project names:

    • lower case letters and non-duplicated dashes (-) not at either end

  • environment and namespace names:

    • lower case letters and non-duplicated underscores (_) not at either end

Metadata

  • [a-b]*_table table names based on singular form of entity e.g. dog_table

  • see all in dogshow standard

Rules

Notes of rules that are necessary for operation. Too strict or stupid rules need to change!

  • Only one registered function per namespace

    • unless for separate environments, like with the helper function of data loaders and environment creation

  • usually, an env run of a namespace processor (is it called this? TODO) reads and writes to one env, the one it corresponds to, but:

    • can only write to its env

    • can read from a different one, but then can only read from that one

    • this allows for registering a function that creates its environment from a base/complete set

  • TableNameFeatures should be the name of table feature classes if the table name is to be inferred

  • No composite features with the same prefix in the same table

  • Feature name can’t contain __ (dunder)

CLI (WIP)

  • csv format default

  • core namespace default

  • datazimmer registry repository default registry

datazimmer pull dog-show --env top_comps datazimmer pull dogssuccess/sex_matches --parquet datazimmer pull something --registry  uri://sg-registry/...

API

datazimmer Package

sscu-budapest utilities for scientific data engineering

Functions

dump_dfs_to_tables(df_structable_pairs[, ...])

helper function to fill the detected env of a dataset

get_raw_data_path(leaf_name[, project])

if project is None, raw data output path is given, otherwise imported

parse_df(df, entity[, verbose])

register([procfun, dependencies, outputs, ...])

registers a function to the pipeline the names of parameters will matter and will be looked up in conf/envs.yaml params

register_data_loader([fun, extra_deps])

Convenience functions to use register with typical parameters

register_env_creator([fun, extra_deps])

Convenience functions to use register with typical parameters

Classes

AbstractEntity()

CompositeTypeBase()

DzAswan([global_run])

EntityClass(name, identifiers, ...)

Index()

Nullable(dtype)

PersistentState()

ReportFile(filename)

ScruTable(entity[, entity_key_table_map, ...])

SourceUrl(_)

Class Inheritance Diagram

digraph inheritance766a52bd73 { bgcolor=transparent; rankdir=LR; size="8.0, 12.0"; "AbstractEntity" [URL="index.html#datazimmer.AbstractEntity",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "ColAssigner" -> "AbstractEntity" [arrowsize=0.5,style="setlinewidth(0.5)"]; "ColAccessor" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="describe and access raw columns"]; "ColAssigner" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",tooltip="define functions that create columns in a dataframe"]; "ColAccessor" -> "ColAssigner" [arrowsize=0.5,style="setlinewidth(0.5)"]; "CompositeTypeBase" [URL="index.html#datazimmer.CompositeTypeBase",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "ColAssigner" -> "CompositeTypeBase" [arrowsize=0.5,style="setlinewidth(0.5)"]; "DzAswan" [URL="index.html#datazimmer.DzAswan",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "EntityClass" [URL="index.html#datazimmer.EntityClass",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="EntityClass(name: str, identifiers: List[Union[datazimmer.metadata.atoms.PrimitiveFeature, datazimmer.metadata.atoms.CompositeFeature, datazimmer.metadata.atoms.ObjectProperty]] = <factory>, properties: List[Union[datazimmer.metadata.atoms.PrimitiveFeature, datazimmer.metadata.atoms.CompositeFeature, datazimmer.metadata.atoms.ObjectProperty]] = <factory>, parents: List[ForwardRef('EntityClass')] = <factory>, description: Optional[str] = None)"]; "_AtomBase" -> "EntityClass" [arrowsize=0.5,style="setlinewidth(0.5)"]; "Index" [URL="index.html#datazimmer.Index",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "Nullable" [URL="index.html#datazimmer.Nullable",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "PersistentState" [URL="index.html#datazimmer.PersistentState",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="PersistentState()"]; "ReportFile" [URL="index.html#datazimmer.ReportFile",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top",tooltip="ReportFile(filename: str)"]; "ScruTable" [URL="index.html#datazimmer.ScruTable",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "SourceUrl" [URL="index.html#datazimmer.SourceUrl",fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled",target="_top"]; "_AtomBase" [fillcolor=white,fontname="Vera Sans, DejaVu Sans, Liberation Sans, Arial, Helvetica, sans",fontsize=10,height=0.25,shape=box,style="setlinewidth(0.5),filled"]; }

Release Notes

v0.1.0

start datazimmer from sscutils

v0.1.1

fix build - add some cron logic

v0.1.2

revert build

v0.1.3

v0.1.4

v0.1.5

v0.2.1

init explorer frame

v0.2.2

auth improvement

v0.2.3

tighter explorer

v0.2.4

tighter explorer

v0.2.5

multitable dec

v0.2.6

minimal to dataset level

v0.2.7

scrutable extension

v0.3.0

simplification and unification

v0.3.1

incremental fixes

v0.3.10

flat index and improved secrets

v0.3.2

requirement and explorer updates

v0.3.3

init explorer and os fix

v0.3.4

nb requirement fix

v0.3.5

title detection fix

v0.3.6

explorer upgrades

v0.3.7

freeze jb version

v0.3.8

flexible pulling

v0.3.9

sql loader extension

v0.4.0

aswan integration init

v0.4.1

here goes nothing

v0.4.10

dvc realign, minor extensions and reorg

v0.4.11

stage non cached outputs

v0.4.12

private zenodo

v0.4.13

aswan reset plus scrutable pickle fix

v0.4.14

minimal alignment, pre-dependency fix

v0.4.15

move import

v0.4.2

collect dependency fix

v0.4.3

minor iterations, redundancy removal

v0.4.4

git fixes, unifications

v0.4.5

aswan alignment

v0.4.6

aswan align

v0.4.7

simplifications and raw-data introduction

v0.4.8

properly citeable

v0.4.9

init zenodo integration

v0.5.0

fix dependencies and drop explorer

v0.5.1

many many bugs fixed

v0.5.2

external dvc zimmauth

v0.5.3

properly rm dvc as dependency

v0.5.4

metaclass def bugfix

Indices and tables