An HDF5 dataset is the leaf of the HDF5 hierarchy, the actual stored data. It’s a multi-dimensional array of homogeneous numerical type, with a name, a shape, a dtype, and optional Metadata attached as attributes. Datasets play the role that files play in a Linux filesystem; groups play the role of directories.

We create a dataset by calling create_dataset on the file object or on a group:

with h5py.File('./hdf5_data.h5', 'w') as hdf:
    hdf.create_dataset('dataset1', data=matrix_1)

This writes a NumPy array into the file under the name dataset1. The dataset stores the data plus the shape (1000, 1000) and the dtype (say float64, written <f8 in HDF5’s notation: little-endian, 8 bytes, floating point).

When we read a dataset back, h5py returns a wrapper object, an h5py.Dataset instance, that represents the on-disk dataset without loading it into memory yet:

with h5py.File('./hdf5_data.h5', 'r') as hdf:
    dataset1 = hdf.get('dataset1')
    print(type(dataset1))            # <class 'h5py._hl.dataset.Dataset'> — type-check against h5py.Dataset
    my_array = np.array(dataset1)    # forces the bytes to be read off disk
    print(type(my_array))            # numpy.ndarray

Wrapping in np.array(...) materializes the bytes. This two-step pattern, open a handle then materialize when we actually need the values, is what lets HDF5 do partial reads: if the dataset is a 100-GB array and we only need a 100-MB slice, we can slice the handle (dataset1[1000:2000]) and only the relevant bytes get read.

Datasets can be created with compression (gzip, lzf, or szip) to trade write/read speed for disk space.