Quantify dataset specification

See also

The complete source code of this tutorial can be found in

Quantify dataset - specification.ipynb

Quantify dataset - specification.py

This document describes the Quantify dataset specification. Here we focus on the concepts and terminology specific to the Quantify dataset. It is based on the Xarray dataset, hence, we assume basic familiarity with the xarray.Dataset. If you are not familiar with it, we highly recommend to first have a look at our Xarray - brief introduction for a brief overview.

Coordinates and Variables

The Quantify dataset is an xarray dataset that follows certain conventions. We define “subtypes” of xarray coordinates and variables:

Main coordinate(s)

  • Xarray Coordinates that have an attribute is_main_coord set to True.

  • Often correspond to physical coordinates, e.g., a signal frequency or amplitude.

  • Often correspond to quantities set through Settables.

  • The dataset must have at least one main coordinate.

    • Example: In some cases, the idea of a coordinate does not apply, however a main coordinate in the dataset is required. A simple “index” coordinate should be used, e.g., an array of integers.

  • See also the method get_main_coords().

Secondary coordinate(s)

Main variable(s)

  • Xarray Variables that have an attribute is_main_var set to True.

  • Often correspond to a physical quantity being measured, e.g., the signal magnitude at a specific frequency measured on a metal contact of a quantum chip.

  • Often correspond to quantities returned by Gettables.

  • See also get_main_vars().

Secondary variables(s)

  • Again, the ubiquitous example are “calibration” datapoints.

  • Similar to main variables, but intended to serve as reference data for other main variables (e.g., calibration data).

  • Xarray Variables that have an attribute is_main_var set to False.

  • The “assignment” of secondary variables to main variables should be done using relationships.

  • See also get_secondary_vars().

Note

In this document we show exemplary datasets to highlight the details of the Quantify dataset specification. However, for completeness, we always show a valid Quantify dataset with all the required properties.

In order to follow the rest of this specification more easily have a look at the example below. It should give you a more concrete feeling of the details that are exposed afterwards. See Quantify dataset - examples for exemplary dataset.

Dimensions

The main variables and coordinates present in a Quantify dataset have the following required and optional xarray dimensions:

Main dimension(s) [Required]

The main dimensions comply with the following:

  • The outermost dimension of any main coordinate/variable, OR the second outermost dimension if the outermost one is a repetitions dimension.

  • Do not require to be explicitly specified in any metadata attributes, instead utilities for extracting them are provided. See get_main_dims() which simply applies the rule above while inspecting all the main coordinates and variables present in the dataset.

  • The dataset must have at least one main dimension.

Note on nesting main dimensions

Nesting main dimensions is allowed in principle and such examples are provided but it should be considered an experimental feature.

Secondary dimension(s) [Optional]

Equivalent to the main dimensions but used by the secondary coordinates and variables. The secondary dimensions comply with the following:

  • The outermost dimension of any secondary coordinate/variable, OR the second outermost dimension if the outermost one is a repetitions dimension.

  • Do not require to be explicitly specified in any metadata attributes, instead utilities for extracting them are provided. See get_secondary_dims() which simply applies the rule above while inspecting all the secondary coordinates and variables present in the dataset.

Repetitions dimension(s) [Optional]

Repetition dimensions comply with the following:

  • Any dimension that is the outermost dimension of a main or secondary variable when its attribute QVarAttrs.has_repetitions is set to True.

  • Intuition for this xarray dimension(s): the equivalent would be to have dataset_reptition_0.hdf5, dataset_reptition_1.hdf5, etc. where each dataset was obtained from repeating exactly the same experiment. Instead we define an outer dimension for this.

  • Default behavior of (live) plotting and analysis tools can be to average the main variables along the repetitions dimension(s).

  • Can be the outermost dimension of the main (and secondary) variables.

  • Variables can lie along one (and only one) repetitions outermost dimension.

Dataset attributes

The required attributes of the Quantify dataset are defined by the following dataclass. It can be used to generate a default dictionary that is attached to a dataset under the xarray.Dataset.attrs attribute.

class QDatasetAttrs(tuid=None, dataset_name='', dataset_state=None, timestamp_start=None, timestamp_end=None, quantify_dataset_version='2.0.0', software_versions=<factory>, relationships=<factory>, json_serialize_exclude=<factory>)[source]

Bases: DataClassJsonMixin

A dataclass representing the attrs attribute of the Quantify dataset.

All attributes are mandatory to be present but can be None.

Example

import pendulum

from quantify_core.utilities import examples_support

examples_support.mk_dataset_attrs(
    dataset_name="Bias scan",
    timestamp_start=pendulum.now().to_iso8601_string(),
    timestamp_end=pendulum.now().add(minutes=2).to_iso8601_string(),
    dataset_state="done",
)
{
    'tuid': '20230926-194343-406-e5bcea',
    'dataset_name': 'Bias scan',
    'dataset_state': 'done',
    'timestamp_start': '2023-09-26T19:43:43.406197+02:00',
    'timestamp_end': '2023-09-26T19:45:43.406238+02:00',
    'quantify_dataset_version': '2.0.0',
    'software_versions': {},
    'relationships': [],
    'json_serialize_exclude': []
}
dataset_name: str = ''

The dataset name, usually same as the the experiment name included in the name of the experiment container.

dataset_state: Literal[None, 'running', 'interrupted (safety)', 'interrupted (forced)', 'done'] = None

Denotes the last known state of the experiment/data acquisition that served to ‘build’ this dataset. Can be used later to filter ‘bad’ datasets.

json_serialize_exclude: List[str] = ()

A list of strings corresponding to the names of other attributes that should not be json-serialized when writing the dataset to disk. Empty by default.

quantify_dataset_version: str = '2.0.0'

A string identifying the version of this Quantify dataset for backwards compatibility.

relationships: List[QDatasetIntraRelationship] = ()

A list of relationships within the dataset specified as list of dictionaries that comply with the QDatasetIntraRelationship.

software_versions: Dict[str, str] = ()

A mapping of other relevant software packages that are relevant to log for this dataset. Another example is the git tag or hash of a commit of a lab repository.

Example

import pendulum

from quantify_core.utilities import examples_support

examples_support.mk_dataset_attrs(
    dataset_name="My experiment",
    timestamp_start=pendulum.now().to_iso8601_string(),
    timestamp_end=pendulum.now().add(minutes=2).to_iso8601_string(),
    software_versions={
        "lab_fridge_magnet_driver": "v1.4.2",  # software version/tag
        "my_lab_repo": "9d8acf63f48c469c1b9fa9f2c3cf230845f67b18",  # git commit hash
    },
)
{
    'tuid': '20230926-194343-423-7ca250',
    'dataset_name': 'My experiment',
    'dataset_state': None,
    'timestamp_start': '2023-09-26T19:43:43.423773+02:00',
    'timestamp_end': '2023-09-26T19:45:43.423819+02:00',
    'quantify_dataset_version': '2.0.0',
    'software_versions': {
        'lab_fridge_magnet_driver': 'v1.4.2',
        'my_lab_repo': '9d8acf63f48c469c1b9fa9f2c3cf230845f67b18'
    },
    'relationships': [],
    'json_serialize_exclude': []
}
timestamp_end: Optional[str] = None

Human-readable timestamp (ISO8601) as returned by pendulum.now().to_iso8601_string() (docs). Specifies when the experiment/data acquisition ended.

timestamp_start: Optional[str] = None

Human-readable timestamp (ISO8601) as returned by pendulum.now().to_iso8601_string() (docs). Specifies when the experiment/data acquisition started.

tuid: Optional[str] = None

The time-based unique identifier of the dataset. See quantify_core.data.types.TUID.

Additionally in order to express relationships between coordinates and/or variables the the following template is provided:

class QDatasetIntraRelationship(item_name=None, relation_type=None, related_names=<factory>, relation_metadata=<factory>)[source]

Bases: DataClassJsonMixin

A dataclass representing a dictionary that specifies a relationship between dataset variables.

A prominent example are calibration points contained within one variable or several variables that are necessary to interpret correctly the data of another variable.

Examples

This is how the attributes of a dataset containing a q0 main variable and q0_cal secondary variables would look like. The q0_cal corresponds to calibrations datapoints. See Quantify dataset - examples for examples with more context.

from quantify_core.data.dataset_attrs import QDatasetIntraRelationship
from quantify_core.utilities import examples_support

attrs = examples_support.mk_dataset_attrs(
    relationships=[
        QDatasetIntraRelationship(
            item_name="q0",
            relation_type="calibration",
            related_names=["q0_cal"],
        ).to_dict()
    ]
)
item_name: str = None

The name of the coordinate/variable to which we want to relate other coordinates/variables.

related_names: List[str] = ()

A list of names related to the item_name.

relation_metadata: Dict[str, Any] = ()

A free-form dictionary to store additional information relevant to this relationship.

relation_type: str = None

A string specifying the type of relationship.

Reserved relation types:

"calibration" - Specifies a list of main variables used as calibration data for the main variables whose name is specified by the item_name.

from quantify_core.data.dataset_attrs import QDatasetAttrs

# tip: to_json and from_dict, from_json  are also available
dataset.attrs = QDatasetAttrs().to_dict()
dataset.attrs
{
    'tuid': None,
    'dataset_name': '',
    'dataset_state': None,
    'timestamp_start': None,
    'timestamp_end': None,
    'quantify_dataset_version': '2.0.0',
    'software_versions': {},
    'relationships': [],
    'json_serialize_exclude': []
}

Tip

Note that xarray automatically provides the entries of the dataset attributes as python attributes. And similarly for the xarray coordinates and data variables.

dataset.quantify_dataset_version, dataset.tuid
('2.0.0', None)

Main coordinates and variables attributes

Similar to the dataset attributes (xarray.Dataset.attrs), the main coordinates and variables have each their own required attributes attached to them as dictionary under the xarray.DataArray.attrs attribute.

class QCoordAttrs(unit='', long_name='', is_main_coord=None, uniformly_spaced=None, is_dataset_ref=False, json_serialize_exclude=<factory>)[source]

Bases: DataClassJsonMixin

A dataclass representing the attrs attribute of main and secondary coordinates.

All attributes are mandatory to be present but can be None.

Examples

from quantify_core.utilities import examples_support

examples_support.mk_main_coord_attrs()
{
    'unit': '',
    'long_name': '',
    'is_main_coord': True,
    'uniformly_spaced': True,
    'is_dataset_ref': False,
    'json_serialize_exclude': []
}
examples_support.mk_secondary_coord_attrs()
{
    'unit': '',
    'long_name': '',
    'is_main_coord': False,
    'uniformly_spaced': True,
    'is_dataset_ref': False,
    'json_serialize_exclude': []
}
is_dataset_ref: bool = False

Flags if it is an array of quantify_core.data.types.TUID s of other dataset.

is_main_coord: bool = None

When set to True, flags the xarray coordinate to correspond to a main coordinate, otherwise (False) it corresponds to a secondary coordinate.

json_serialize_exclude: List[str] = ()

A list of strings corresponding to the names of other attributes that should not be json-serialized when writing the dataset to disk. Empty by default.

long_name: str = ''

A long name for this coordinate.

uniformly_spaced: Optional[bool] = None

Indicates if the values are uniformly spaced.

unit: str = ''

The units of the values.

dataset.amp.attrs
{
    'unit': 'V',
    'long_name': 'Amplitude',
    'is_main_coord': True,
    'uniformly_spaced': True,
    'is_dataset_ref': False,
    'json_serialize_exclude': []
}
class QVarAttrs(unit='', long_name='', is_main_var=None, uniformly_spaced=None, grid=None, is_dataset_ref=False, has_repetitions=False, json_serialize_exclude=<factory>)[source]

Bases: DataClassJsonMixin

A dataclass representing the attrs attribute of main and secondary variables.

All attributes are mandatory to be present but can be None.

Examples

from quantify_core.utilities import examples_support

examples_support.mk_main_var_attrs(coords=["time"])
{
    'unit': '',
    'long_name': '',
    'is_main_var': True,
    'uniformly_spaced': True,
    'grid': True,
    'is_dataset_ref': False,
    'has_repetitions': False,
    'json_serialize_exclude': [],
    'coords': ['time']
}
examples_support.mk_secondary_var_attrs(coords=["cal"])
{
    'unit': '',
    'long_name': '',
    'is_main_var': False,
    'uniformly_spaced': True,
    'grid': True,
    'is_dataset_ref': False,
    'has_repetitions': False,
    'json_serialize_exclude': [],
    'coords': ['cal']
}
grid: Optional[bool] = None

Indicates if the variables data are located on a grid, which does not need to be uniformly spaced along all dimensions. In other words, specifies if the corresponding main coordinates are the ‘unrolled’ points (also known as ‘unstacked’) corresponding to a grid.

If True than it is possible to use quantify_core.data.handling.to_gridded_dataset() to convert the variables to a ‘stacked’ version.

has_repetitions: bool = False

Indicates that the outermost dimension of this variable is a repetitions dimension. This attribute is intended to allow easy programmatic detection of such dimension. It can be used, for example, to average along this dimension before an automatic live plotting or analysis.

is_dataset_ref: bool = False

Flags if it is an array of quantify_core.data.types.TUID s of other dataset. See also Dataset for a “nested MeasurementControl” experiment.

is_main_var: bool = None

When set to True, flags this xarray data variable to correspond to a main variable, otherwise (False) it corresponds to a secondary variable.

json_serialize_exclude: List[str] = ()

A list of strings corresponding to the names of other attributes that should not be json-serialized when writing the dataset to disk. Empty by default.

long_name: str = ''

A long name for this coordinate.

uniformly_spaced: Optional[bool] = None

Indicates if the values are uniformly spaced. This does not apply to ‘true’ main variables but, because a MultiIndex is not supported yet by xarray when writing to disk, some coordinate variables have to be stored as main variables instead.

unit: str = ''

The units of the values.

dataset.pop_q0.attrs
{
    'unit': '',
    'long_name': 'Population Q0',
    'is_main_var': True,
    'uniformly_spaced': True,
    'grid': True,
    'is_dataset_ref': False,
    'has_repetitions': True,
    'json_serialize_exclude': []
}

Storage format

The Quantify dataset is written to disk and loaded back making use of xarray-supported facilities. Internally we write and load to/from disk using:

display_source_code(dh.write_dataset)
display_source_code(dh.load_dataset)
def write_dataset(path: Union[Path, str], dataset: xr.Dataset) -> None:
    """
    Writes a :class:`~xarray.Dataset` to a file with the `h5netcdf` engine.

    Before writing the
    :meth:`AdapterH5NetCDF.adapt() <quantify_core.data.dataset_adapters.AdapterH5NetCDF.adapt>`
    is applied.

    To accommodate for complex-type numbers and arrays ``invalid_netcdf=True`` is used.

    Parameters
    ----------
    path
        Path to the file including filename and extension
    dataset
        The :class:`~xarray.Dataset` to be written to file.
    """  # pylint: disable=line-too-long
    _xarray_numpy_bool_patch(dataset)  # See issue #161 in quantify-core
    # Only quantify_dataset_version=>2.0.0 requires the adapter
    if "quantify_dataset_version" in dataset.attrs:
        dataset = da.AdapterH5NetCDF.adapt(dataset)
    dataset.to_netcdf(path, engine="h5netcdf", invalid_netcdf=True)
def load_dataset(
    tuid: TUID, datadir: str = None, name: str = DATASET_NAME
) -> xr.Dataset:
    """
    Loads a dataset specified by a tuid.

    .. tip::

        This method also works when specifying only the first part of a
        :class:`~quantify_core.data.types.TUID`.

    .. note::

        This method uses :func:`~.load_dataset` to ensure the file is closed after
        loading as datasets are intended to be immutable after performing the initial
        experiment.

    Parameters
    ----------
    tuid
        A :class:`~quantify_core.data.types.TUID` string. It is also possible to specify
        only the first part of a tuid.
    datadir
        Path of the data directory. If ``None``, uses :meth:`~get_datadir` to determine
        the data directory.
    Returns
    -------
    :
        The dataset.
    Raises
    ------
    FileNotFoundError
        No data found for specified date.
    """
    return load_dataset_from_path(_locate_experiment_file(tuid, datadir, name))

Note that we use the h5netcdf engine that is more permissive than the default NetCDF engine to accommodate for arrays of complex numbers.

Note

Furthermore, in order to support a variety of attribute types (e.g. the None type) and shapes (e.g. nested dictionaries) in a seamless dataset round trip, some additional tooling is required. See source codes below that implements the two-way conversion adapter used by the functions shown above.

display_source_code(dadapters.AdapterH5NetCDF)
class AdapterH5NetCDF(DatasetAdapterBase):
    """
    Quantify dataset adapter for the ``h5netcdf`` engine.

    It has the functionality of adapting the Quantify dataset to a format compatible
    with the ``h5netcdf`` xarray backend engine that is used to write and load the
    dataset to/from disk.

    .. warning::

        The ``h5netcdf`` engine has minor issues when performing a two-way trip of the
        dataset. The ``type`` of some attributes are not preserved. E.g., list- and
        tuple-like objects are loaded as numpy arrays of ``dtype=object``.
    """

    @classmethod
    def adapt(cls, dataset: xr.Dataset) -> xr.Dataset:
        """
        Serializes to JSON the dataset and variables attributes.

        To prevent the JSON serialization for specific items, their names should be
        listed under the attribute named ``json_serialize_exclude`` (for each ``attrs``
        dictionary).

        Parameters
        ----------
        dataset
            Dataset that needs to be adapted.

        Returns
        -------
        :
            Dataset in which the attributes have been replaced with their JSON strings
            version.
        """

        return cls._transform(dataset, vals_converter=json.dumps)

    @classmethod
    def recover(cls, dataset: xr.Dataset) -> xr.Dataset:
        """
        Reverts the action of ``.adapt()``.

        To prevent the JSON de-serialization for specific items, their names should be
        listed under the attribute named ``json_serialize_exclude``
        (for each ``attrs`` dictionary).

        Parameters
        ----------
        dataset
            Dataset from which to recover the original format.

        Returns
        -------
        :
            Dataset in which the attributes have been replaced with their python objects
            version.
        """

        return cls._transform(dataset, vals_converter=json.loads)

    @staticmethod
    def attrs_convert(
        attrs: dict,
        inplace: bool = False,
        vals_converter: Callable[Any, Any] = json.dumps,
    ) -> dict:
        """
        Converts to/from JSON string the values of the keys which are not listed in the
        ``json_serialize_exclude`` list.

        Parameters
        ----------
        attrs
            The input dictionary.
        inplace
            If ``True`` the values are replaced in place, otherwise a deepcopy of
            ``attrs`` is performed first.
        """
        json_serialize_exclude = attrs.get("json_serialize_exclude", [])

        attrs = attrs if inplace else deepcopy(attrs)
        for attr_name, attr_val in attrs.items():
            if attr_name not in json_serialize_exclude:
                attrs[attr_name] = vals_converter(attr_val)
        return attrs

    @classmethod
    def _transform(
        cls, dataset: xr.Dataset, vals_converter: Callable[Any, Any] = json.dumps
    ) -> xr.Dataset:
        dataset = xr.Dataset(
            dataset,
            attrs=cls.attrs_convert(
                dataset.attrs, inplace=False, vals_converter=vals_converter
            ),
        )

        for var_name in dataset.variables.keys():
            # The new dataset generated above has already a deepcopy of the attributes.
            _ = cls.attrs_convert(
                dataset[var_name].attrs, inplace=True, vals_converter=vals_converter
            )

        return dataset