lgdo.lh5 package¶
Routines from reading and writing LEGEND Data Objects in HDF5 files.
Currently the primary on-disk format for LGDO object is LEGEND HDF5 (LH5) files. IO
is done via the class store.LH5Store. LH5 files can also be
browsed easily in python like any HDF5 file using
h5py.
Subpackages¶
- lgdo.lh5._serializers package
- Subpackages
- lgdo.lh5._serializers.read package
- Submodules
- lgdo.lh5._serializers.read.array module
- lgdo.lh5._serializers.read.composite module
- lgdo.lh5._serializers.read.encoded module
- lgdo.lh5._serializers.read.ndarray module
- lgdo.lh5._serializers.read.scalar module
- lgdo.lh5._serializers.read.utils module
- lgdo.lh5._serializers.read.vector_of_vectors module
- lgdo.lh5._serializers.write package
- lgdo.lh5._serializers.read package
- Subpackages
Submodules¶
lgdo.lh5.core module¶
- lgdo.lh5.core.read(name, lh5_file, start_row=0, n_rows=9223372036854775807, idx=None, use_h5idx=False, field_mask=None, obj_buf=None, obj_buf_start=0, decompress=True)¶
Read LH5 object data from a file.
Note
Use the
idxparameter to read out particular rows of the data. Theuse_h5idxflag controls whether only those rows are read from disk or if the rows are indexed after reading the entire object. Reading individual rows can be orders of magnitude slower than reading the whole object and then indexing the desired rows. The default behavior (use_h5idx=False) is to use slightly more memory for a much faster read. See legend-pydataobj/issues/#29 for additional information.- Parameters:
name (str) – Name of the LH5 object to be read (including its group path).
lh5_file (str | File | Sequence[str | File]) – The file(s) containing the object to be read out. If a list of files, array-like object data will be concatenated into the output object.
start_row (int) – Starting entry for the object read (for array-like objects). For a list of files, only applies to the first file.
n_rows (int) – The maximum number of rows to read (for array-like objects). The actual number of rows read will be returned as one of the return values (see below).
idx (_SupportsArray[dtype[Any]] | _NestedSequence[_SupportsArray[dtype[Any]]] | bool | int | float | complex | str | bytes | _NestedSequence[bool | int | float | complex | str | bytes] | None) – For NumPy-style “fancying indexing” for the read to select only some rows, e.g. after applying some cuts to particular columns. Only selection along the first axis is supported, so tuple arguments must be one-tuples. If n_rows is not false, idx will be truncated to n_rows before reading. To use with a list of files, can pass in a list of idx’s (one for each file) or use a long contiguous list (e.g. built from a previous identical read). If used in conjunction with start_row and n_rows, will be sliced to obey those constraints, where n_rows is interpreted as the (max) number of selected values (in idx) to be read out. Note that the
use_h5idxparameter controls some behaviour of the read and that the default behavior (use_h5idx=False) prioritizes speed over a small memory penalty.use_h5idx (bool) –
Truewill directly pass theidxparameter to the underlyingh5pycall such that only the selected rows are read directly into memory, which conserves memory at the cost of speed. There can be a significant penalty to speed for larger files (1 - 2 orders of magnitude longer time).False(default) will read the entire object into memory before performing the indexing. The default is much faster but requires additional memory, though a relatively small amount in the typical use case. It is recommended to leave this parameter as its default.field_mask (Mapping[str, bool] | Sequence[str] | None) – For tables and structs, determines which fields get read out. Only applies to immediate fields of the requested objects. If a dict is used, a default dict will be made with the default set to the opposite of the first element in the dict. This way if one specifies a few fields at
False, all but those fields will be read out, while if one specifies just a few fields asTrue, only those fields will be read out. If a list is provided, the listed fields will be set toTrue, while the rest will default toFalse.obj_buf (LGDO | None) – Read directly into memory provided in obj_buf. Note: the buffer will be expanded to accommodate the data requested. To maintain the buffer length, send in
n_rows = len(obj_buf).obj_buf_start (int) – Start location in
obj_buffor read. For concatenating data to array-like objects.decompress (bool) – Decompress data encoded with LGDO’s compression routines right after reading. The option has no effect on data encoded with HDF5 built-in filters, which is always decompressed upstream by HDF5.
- Returns:
(object, n_rows_read) – object is the read-out object n_rows_read is the number of rows successfully read out. Essential for arrays when the amount of data is smaller than the object buffer. For scalars and structs n_rows_read will be``1``. For tables it is redundant with
table.loc. If obj_buf isNone, only object is returned.- Return type:
- lgdo.lh5.core.read_as(name, lh5_file, library, **kwargs)¶
Read LH5 data from disk straight into a third-party data format view.
This function is nothing more than a shortcut chained call to
read()and toLGDO.view_as().- Parameters:
- Return type:
See also
- lgdo.lh5.core.write(obj, name, lh5_file, group='/', start_row=0, n_rows=None, wo_mode='append', write_start=0, **h5py_kwargs)¶
Write an LGDO into an LH5 file.
If the obj
LGDOhas a compression attribute, its value is interpreted as the algorithm to be used to compress obj before writing to disk. The type of compression can be:- string, kwargs dictionary, hdf5plugin filter
interpreted as the name of a built-in or custom HDF5 compression filter (
"gzip","lzf",hdf5pluginfilter object etc.) and passed directly toh5py.Group.create_dataset().WaveformCodecobjectIf obj is a
WaveformTableandobj.valuesholds the attribute, compressvaluesusing this algorithm. More documentation about the supported waveform compression algorithms atlgdo.compression.
If the obj
LGDOhas a hdf5_settings attribute holding a dictionary, it is interpreted as a list of keyword arguments to be forwarded directly toh5py.Group.create_dataset()(exactly like the first format of compression above). This is the preferred way to specify HDF5 dataset options such as chunking etc. If compression options are specified, they take precedence over those set with the compression attribute.Note
The compression LGDO attribute takes precedence over the default HDF5 compression settings. The hdf5_settings attribute takes precedence over compression. These attributes are not written to disk.
Note
HDF5 compression is skipped for the encoded_data.flattened_data dataset of
VectorOfEncodedVectorsandArrayOfEncodedEqualSizedArrays.- Parameters:
obj (LGDO) – LH5 object. if object is array-like, writes n_rows starting from start_row in obj.
name (str) – name of the object in the output HDF5 file.
group (str | Group) – HDF5 group name or
h5py.Groupobject in which obj should be written.start_row (int) – first row in obj to be written.
n_rows (int | None) – number of rows in obj to be written.
wo_mode (str) –
write_safeorw: only proceed with writing if the object does not already exist in the file.appendora: append along axis 0 (the first dimension) of array-like objects and array-like subfields of structs.Scalarobjects get overwritten.overwriteoro: replace data in the file if present, starting from write_start. Note: overwriting with write_start = end of array is the same asappend.overwrite_fileorof: delete file if present prior to writing to it. write_start should be 0 (its ignored).append_columnorac: append columns from anTableobj only if there is an existingTablein the lh5_file with the same name andsize. If the sizes don’t match, or if there are matching fields, it errors out.
write_start (int) – row in the output file (if already existing) to start overwriting from.
**h5py_kwargs – additional keyword arguments forwarded to
h5py.Group.create_dataset()to specify, for example, an HDF5 compression filter to be applied before writing non-scalar datasets. Note: `compression` Ignored if compression is specified as an `obj` attribute.
lgdo.lh5.datatype module¶
- lgdo.lh5.datatype._lgdo_datatype_map: dict[str, LGDO] = {<class 'lgdo.types.array.Array'>: '^array<\\d+>\\{.+\\}$', <class 'lgdo.types.arrayofequalsizedarrays.ArrayOfEqualSizedArrays'>: '^array_of_equalsized_arrays<1,1>\\{.+\\}$', <class 'lgdo.types.encoded.ArrayOfEncodedEqualSizedArrays'>: '^array_of_encoded_equalsized_arrays<1,1>\\{.+\\}$', <class 'lgdo.types.encoded.VectorOfEncodedVectors'>: '^array<1>\\{encoded_array<1>\\{.+\\}\\}$', <class 'lgdo.types.fixedsizearray.FixedSizeArray'>: '^fixedsize_array<\\d+>\\{.+\\}$', <class 'lgdo.types.scalar.Scalar'>: '^real$|^bool$|^complex$|^bool$|^string$', <class 'lgdo.types.struct.Struct'>: '^struct\\{.*\\}$', <class 'lgdo.types.table.Table'>: '^table\\{.*\\}$', <class 'lgdo.types.vectorofvectors.VectorOfVectors'>: '^array<1>\\{array<1>\\{.+\\}\\}$'}¶
Mapping between LGDO types and regular expression defining the corresponding datatype string
- lgdo.lh5.datatype.datatype(expr)¶
Return the LGDO type corresponding to a datatype string.
- Return type:
- lgdo.lh5.datatype.get_nested_datatype_string(expr)¶
Matches the content of the outermost curly brackets.
- Return type:
lgdo.lh5.exceptions module¶
lgdo.lh5.iterator module¶
- class lgdo.lh5.iterator.LH5Iterator(lh5_files, groups, base_path='', entry_list=None, entry_mask=None, field_mask=None, buffer_len=3200, friend=None)¶
Bases:
IteratorA class for iterating through one or more LH5 files, one block of entries at a time. This also accepts an entry list/mask to enable event selection, and a field mask.
This class can be used either for random access:
>>> lh5_obj, n_rows = lh5_it.read(entry)
to read the block of entries starting at entry. In case of multiple files or the use of an event selection, entry refers to a global event index across files and does not count events that are excluded by the selection.
This can also be used as an iterator:
>>> for lh5_obj, entry, n_rows in LH5Iterator(...): >>> # do the thing!
This is intended for if you are reading a large quantity of data but want to limit your memory usage (particularly when reading in waveforms!). The
lh5_objthat is read by this class is reused in order to avoid reallocation of memory; this means that if you want to hold on to data between reads, you will have to copy it somewhere!- Parameters:
lh5_files (str | list[str]) – file or files to read from. May include wildcards and environment variables.
groups (str | list[str]) – HDF5 group(s) to read. If a list is provided for both lh5_files and group, they must be the same size. If a file is wild-carded, the same group will be assigned to each file found
entry_list (list[int] | list[list[int]] | None) – list of entry numbers to read. If a nested list is provided, expect one top-level list for each file, containing a list of local entries. If a list of ints is provided, use global entries.
entry_mask (list[bool] | list[list[bool]] | None) – mask of entries to read. If a list of arrays is provided, expect one for each file. Ignore if a selection list is provided.
field_mask (dict[str, bool] | list[str] | tuple[str] | None) – mask of which fields to read. See
LH5Store.read()for more details.buffer_len (int) – number of entries to read at a time while iterating through files.
friend (Iterator | None) – a “friend” LH5Iterator that will be read in parallel with this. The friend should have the same length and entry list. A single LH5 table containing columns from both iterators will be returned.
- _abc_impl = <_abc._abc_data object>¶
- read(entry)¶
Read the nextlocal chunk of events, starting at entry. Return the LH5 buffer and number of rows read.
- reset_field_mask(mask)¶
Replaces the field mask of this iterator and any friends with mask
lgdo.lh5.store module¶
This module implements routines from reading and writing LEGEND Data Objects in HDF5 files.
- class lgdo.lh5.store.LH5Store(base_path='', keep_open=False)¶
Bases:
objectClass to represent a store of LEGEND HDF5 files. The two main methods implemented by the class are
read()andwrite().Examples
>>> from lgdo import LH5Store >>> store = LH5Store() >>> obj, _ = store.read("/geds/waveform", "file.lh5") >>> type(obj) lgdo.waveformtable.WaveformTable
- Parameters:
- get_buffer(name, lh5_file, size=None, field_mask=None)¶
Returns an LH5 object appropriate for use as a pre-allocated buffer in a read loop. Sets size to size if object has a size.
- Return type:
- gimme_file(lh5_file, mode='r')¶
Returns a
h5pyfile object from the store or creates a new one.
- gimme_group(group, base_group, grp_attrs=None, overwrite=False)¶
Returns an existing
h5pygroup from a base group or creates a new one.See also
- Return type:
Group
- read(name, lh5_file, start_row=0, n_rows=9223372036854775807, idx=None, use_h5idx=False, field_mask=None, obj_buf=None, obj_buf_start=0, decompress=True)¶
Read LH5 object data from a file in the store.
See also
- read_n_rows(name, lh5_file)¶
Look up the number of rows in an Array-like object called name in lh5_file.
Return
Noneif it is aScalaror aStruct.- Return type:
int | None
- write(obj, name, lh5_file, group='/', start_row=0, n_rows=None, wo_mode='append', write_start=0, **h5py_kwargs)¶
Write an LGDO into an LH5 file.
See also
lgdo.lh5.tools module¶
- lgdo.lh5.tools.load_dfs(f_list, par_list, lh5_group='', idx_list=None)¶
Build a
pandas.DataFramefrom LH5 data.Given a list of files (can use wildcards), a list of LH5 columns, and optionally the group path, return a
pandas.DataFramewith all values for each parameter.See also
- Returns:
dataframe – contains columns for each parameter in par_list, and rows containing all data for the associated parameters concatenated over all files in f_list.
- Return type:
- lgdo.lh5.tools.load_nda(f_list, par_list, lh5_group='', idx_list=None)¶
Build a dictionary of
numpy.ndarrays from LH5 data.Given a list of files, a list of LH5 table parameters, and an optional group path, return a NumPy array with all values for each parameter.
- Parameters:
f_list (str | list[str]) – A list of files. Can contain wildcards.
par_list (list[str]) – A list of parameters to read from each file.
lh5_group (str) – group path within which to find the specified parameters.
idx_list (list[ndarray[Any, dtype[_ScalarType_co]] | list | tuple] | None) – for fancy-indexed reads. Must be one index array for each file in f_list.
- Returns:
par_data – A dictionary of the parameter data keyed by the elements of par_list. Each entry contains the data for the specified parameter concatenated over all files in f_list.
- Return type:
- lgdo.lh5.tools.ls(lh5_file, lh5_group='', recursive=False)¶
Return a list of LH5 groups in the input file and group, similar to
lsorh5ls. Supports wildcards in group names.
- lgdo.lh5.tools.show(lh5_file, lh5_group='/', attrs=False, indent='', header=True, depth=None, detail=False)¶
Print a tree of LH5 file contents with LGDO datatype.
- Parameters:
lh5_file (str | Group) – the LH5 file.
lh5_group (str) – print only contents of this HDF5 group.
attrs (bool) – print the HDF5 attributes too.
indent (str) – indent the diagram with this string.
header (bool) – print lh5_group at the top of the diagram.
depth (int | None) – maximum tree depth of groups to print
detail (bool) – whether to print additional information about how the data is stored
Examples
>>> from lgdo import show >>> show("file.lh5", "/geds/raw") /geds/raw ├── channel · array<1>{real} ├── energy · array<1>{real} ├── timestamp · array<1>{real} ├── waveform · table{t0,dt,values} │ ├── dt · array<1>{real} │ ├── t0 · array<1>{real} │ └── values · array_of_equalsized_arrays<1,1>{real} └── wf_std · array<1>{real}
lgdo.lh5.utils module¶
Implements utilities for LEGEND Data Objects.
- lgdo.lh5.utils.expand_path(path, substitute=None, list=False, base_path=None)¶
Expand (environment) variables and wildcards to return absolute paths.
- Parameters:
path (str) – name of path, which may include environment variables and wildcards.
list (bool) – if
True, return a list. IfFalse, return a string; ifFalseand a unique file is not found, raise an exception.substitute (dict[str, str] | None) – use this dictionary to substitute variables. Environment variables take precedence.
base_path (str | None) – name of base path. Returned paths will be relative to base.
- Returns:
path or list of paths – Unique absolute path, or list of all absolute paths
- Return type:
- lgdo.lh5.utils.expand_vars(expr, substitute=None)¶
Expand (environment) variables.
Note
Malformed variable names and references to non-existing variables are left unchanged.
- lgdo.lh5.utils.fmtbytes(num, suffix='B')¶
Returns formatted f-string for printing human-readable number of bytes.
- lgdo.lh5.utils.get_buffer(name, lh5_file, size=None, field_mask=None)¶
Returns an LGDO appropriate for use as a pre-allocated buffer.
Sets size to size if object has a size.
- Return type:
- lgdo.lh5.utils.get_h5_group(group, base_group, grp_attrs=None, overwrite=False)¶
Returns an existing
h5pygroup from a base group or creates a new one. Can also set (or replace) group attributes.